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PREFACE 


Elementary Statistics and Applications is designed for a begin¬ 
ning course. Principles of gathering and presenting statistics, 
frequency-distribution analysis, probability theory ; and the 
normal curve, correlation, time-series analysis, and forecasting 
are included. Elementary sampling procedure, onfy so far as 
it is founded upon the assumption of normal .sampling dis¬ 
tributions, is also included. 

No attempt has been made to include any of the less con¬ 
ventional methods of time-series analysis. Some are too mathe¬ 
matical for treatment in an elementary text. Others are so 
highly specialized or so subjective as to be unsuited for textbook 
material. Many of these are new methods that need to be 
further systematized, coordinated, and tested in the crucible of 
time and experience. 

The approach in this book is that of the teacher. The authors 
have been associated in teaching statistics for more than ten 
years. The manuscript of the present text evolved during those 
years in mimeographed form, modified from year to year as new 
theories developed and as teaching use required. The sug¬ 
gestions of students, whether consciously or unconsciously made, 
have helped formulate this book. Experience has shown that 
students gain a sense of the close association of statistics to 
reality from the brief discussions of the historical origin of impor¬ 
tant steps in the development of statistical theory that are 
included. 

The descriptions of frequency-distribution, correlation, and 
time-series analysis are first completed in their simplest aspects, 
with elementary illustrations. This enables the student to 
visualize basic method unmixed with the more advanced phases. 
More complex illustrations of practical application are then given 
in separate chapters or separate sections. This practice elimi¬ 
nates the apparent digression that seems to hamper the student 
when exposition of method and complicated illustrations are 
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intermixed, as in the conventional text. In separating the two, 
moreover, the fact is recognized that the best order of presenta¬ 
tion for teaching is not the best order of procedure for working 
an actual problem. For example, the handiest method for 
making a frequency-distribution analysis is to set up a work sheet 
with a Charlier check and first calculate the moments or the k 
statistics; but the theories of the moments and of the k statistics 
are among the most difficult parts of the analysis to explain and 
are not therefore good introductory topics for the teaching of 
frequency-distribution analysis. In addition, the practical 
analysis of the frequency distribution introduces short cuts, cross 
checks, or other timesaving devices. The authors believe that 
this new arrangement will also prove to be a boon to research 
workers who may use the text as a reference book. 

The more advanced points of statistical theory pertaining to 
frequency curves and sampling analysis have been placed in a 
separate book entitled Sampling Statistics and Applications, The 
two books together constitute a set on the subject of Funda^ 
mentals of the Theory of Statistics, 

In both volumes, the authors have drawn freely upon the 
many monographs and the periodical literature that have 
appeared during recent years. Care has been exercised to 
make acknowledgment in footnotes to the sources of new ideas 
that have been incorporated into the authors' own development 
of the subject. To all these vigorous workers in the field, too 
numerous to be listed by name, the authors as well as other 
statisticians are greatly indebted. 

More particularly the authors here acknowledge a debt of 
gratitude to several generous professional colleagues who have 
read parts of the manuscript with critical and judicious eye. 
Sidney W. Wilcox, Chief Statistician of the Bureau of Labor 
Statistics in the United States Department of Labor, made 
especially helpful suggestions for Chap. XIX, Index Numbers, 
for the chapters on probability theory, and for Parts I and II 
of Elementary Statistics and Applications. John H. Smith of 
the Bureau of Labor Statistics, contributed many stimulating 
criticisms and suggestions that the authors believe inspired 
important improvements. Lester S. Kellogg, Bureau of Labor 
Statistics, read Chap. Ill, Sources of Statistics, and made sug¬ 
gestions that led to a constructive reworking of that material. 
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The authors are profoundly grateful for such generous assistence 
and wish to make full acknowledgment of their professional 
indebtedness to these men. 

The authors are grateful to the International Finance Section 
of Princeton University for the financial assistance given Acheson 
J. Duncan some years ago to enable him to study statistics and 
mathematical economics with the late Henry Schultz of the 
University of Chicago and with Harold Hotelling of Columbia 
University. The authors are indebted to those men, and to 
colleagues in the Mathematics Department at Princeton Uni¬ 
versity. The authors are also indebted to Professor R, A. 
Fisher, also to Messrs. Oliver & Boyd, Ltd., of Edinburgh, for 
permission to reprint an abridged edition of Table III, Table 
of from their book Statistical Methods for Research Workers. 

Naturally, it is not to be supposed that the whole or any part 
of the manuscript carries the endorsement of the authors^ former 
teachers or those who have helped with criticisms of the manu¬ 
script. The authors assume full responsibility for errors of 
theory or calculation that may be present in the volumes. 

James G. Smith. 

Acheson J. Duncan. 

Princeton, N. J., 

Augustj 1944. 
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PART I 
Introduction 


CHAPTER I 

STATISTICS IN THE ARTS AND SCIENCES 

From the sixteenth century to the present day, modern sciences 
have stressed empirical method—the gathering of data by labora¬ 
tory experiment or by statistical observation. Laboratory 
experimentation has been more spectacularly employed in the 
natural sciences (biology, chemistry, physics, botany, and the 
like), and statistical observation has been more widely used in 
the social sciences (such as politics, economics, and psychology). 
Yet laboratory technique is used for some types of investigation 
in the social sciences, especially in psychology, education, and 
agriculture; and statistical technique is frequently employed 
in the natural sciences; for example, the modem kinetic theory 
of gases is a statistical argument. 

Economy and Flexibility of Statistics. Meaning of Statistics .. 
Statistics and scientific method are of value wherever a mass of 
complicated facts exists and wherever those facts are amenable to 
quantitative expression. Qualitative knowledge must be con¬ 
verted into quantitative units of enumeration or of measurement 
before it becomes statistics. The quantitative units are either 
enumerative or measurement units. An enumerative unit 
depends upon proper definition of the objects to be counted; 
thus statistics may be compiled on the number of blue-eyed 
as compared with brown-eyed people, the number of yellow as 
compared with green beans, etc. A measurement unit depends 
upon contrivance of some unit of measurement for the purpose 
of converting qualitative knowledge into quantitative expression; 
thus properly devised tests make it possible to measure intel¬ 
lectual aptitude on a scale so that certain quantity figures can 

1 
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be depended upon to measure relative amounts of intellectual 
aptitude. 

Such quantitative description of facts makes it possible to 
give in a brief space a great amount of information. In order 
to accomplish this economy of time and space, however, it is 
of the greatest importance that the units of measurement or of 
enumeration be uniformly applied and that the nature of these 
units of measurement for observation or of enumeration be con¬ 
stantly kept in mind when the data are used. Furthermore, 
having always in mind the nature of the statistical units chosen 
as criteria of measurement, it is possible to arrange statistical 
data in such a manner as greatly to facilitate their interpretation. 

A large degree of flexibility is thus available when facts arc 
expressed quantitatively; and, so long as the original units of 
measurement are not obscured, it is possible for specific purposes 
to arrange and rearrange a given set of data. A part of this 
flexibility is due to the fact that otherwise long, time-consum¬ 
ing methods of analysis can be resolved into relatively simple 
mathematical operations. These short cuts and the savings of 
human effort they make possible in the search for truth are only 
possible where knowledge can be expressed quantitatively, 
which is to say, by statistics. In using these short-cut methods, 
however, it is necessary to be ever watchful for hidden incon¬ 
sistencies with the original units of measurement, for it is in this 
realm that many of the misuses of statistics are found. 

Economy and a high degree of flexibility are characteristics of 
statistics that well fit them to serve a dynamic society's needs for 
analysis and formulation of policy. It is a lesson learned from sad 
but profitable experience that statistics are something more than 
the mere will to collect facts in quantitative form. Careful study 
by many scholars has given rise to rules of procedure that must be 
followed if the economy and flexibility of which statistics are 
capable are to be realized. These rules of procedure constitute 
the science of statistics, to several aspects of which attention 
should be directed for differentiation. ‘‘Statistics’’ is used 
broadly to refer to the whole field of the quantitative approach to 
knowledge, including the gathering of data, problems of statistical 
measurement, statistical analysis, statistical theory, and scientific 
method in general. The word “statistics” is also used to refer to 
any one of these parts of the whole subject. 
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Accordingly, while “statistics’^ is used in the broad sense 
indicated, it applies also more particularly and more accurately 
to compiled data that are systematic and quantitative expressions 
of facts or events. 

The theory of statistics is also called “statistics.” The theory 
of statistics is the body of principles that has been developed, 
partly a priori by the mathematical approach and partly by 
empirical methods, to serve as a guide for sound statistics and 
sound statistical method. Understanding of the theory of sta¬ 
tistics is required also for compiling statistics. Statistical 
theory is required because nearly all compilations of quantitative 
facts are samples and not complete enumerations and because the 
fundamental rules regarding units of measurement must be 
obeyed in statistical enumeration if the resulting data are to be 
homogeneous, that is, comparable one \vith another. 

“Statistics” also refers to statistical method, a term used to 
describe the process of interpreting facts by the use of statistics 
and statistical theory. Careful study of the assembled sta¬ 
tistical data, obtained in a manner to secure internal compara¬ 
bility and arranged in well-planned tables, may be used as a 
basis for judgments or action. Further quantitative treatment, 
how'ever, may frequently give greater significance to the sta¬ 
tistics,, Selected summaries may bring out many relationships 
that would be difficult to visualize if they were in tables of figures 
that had been compiled for general purposes. This additional 
quantitative treatment is of the nature of classification and 
summarization. It is called “statistical analysis” and includes 
the methods of tabulation, graphs, averages, measures of varia¬ 
bility, correlation, index numbers, and similar quantitative 
analyses that have been developed. Judgments based on 
statistical analysis are called “statistical inferences.” Sta¬ 
tistical method, then, consists of two parts, (1) statistical analysis 
and (2) statistical inferences. 

In recent years the word “statistics” has also been used to 
describe figures that have been obtained by statistical analysis; 
for example, arithmetic means, average deviations, measures of 
correlation, and the like, are all called “statistics,” and any one 
of them alone is called a Vstatistic.” 

The word “statistics” is thus used to mean all these various 
things together and any one of them separately. This may make 
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for confusion, and in the above discussion such usage makes it 
appear as if terms were defined by use of the term defined; but 
such is established conventional, albeit confused, use of this all- 
inclusive word ‘‘statistics/^ 

THREE TYPES OF DATA 

Empirical vs. Experimental Data. Answering the accusation 
that their conclusions are so vague and unpredictable as to pre¬ 
clude scientific sanction, the social scientists have often pleaded 
that social studies cannot, like the theories of the natural sciences, 
be tested in the laboratory. The social sciences must rely only 
on statistics and empirical or historical methods. Social theories 
can be interpreted with respect to true life only if viewed in the 
light of a ceteris paribus assumption. The assumption that 
other things are equal, or unchanged, or in balance serves the 
social scientist in the same manner as controls over experimental 
conditions serve the natural scientists. 

With the development, on the one hand, of statistical methods 
in the natural sciences and the development, on the other hand, of 
experimental methods in the social sciences, this contrast is 
becoming less real. While it is still true that social science 
predominantly uses empirical or historical data, some important 
work has been done, and more important work appears in the 
offing, with experimental data in the fields of psychology, 
sociology, education, medicine, population studies, agricultural 
economics, and statistical control of quality of manufactured 
products. Such outstanding progress in the technical develop¬ 
ment of this experimental work has been made as to constitute 
almost a special field called “design of experiments.’^^ 

Design of Experiments. The arrangements for making the 
experiment and for recording the data therefrom constitute the 
design of the experiment. In designing an experiment, methods 
of so controlling the experiment as to prevent biased results must 
be devised. If, for example, the experiment is to test the effects 
upon cotton culture of a certain kind of fertilizer, several areas in 
various localities may be chosen in order to test the effect of 
the selected fertilizer under a number of climatic conditions. 
The design for the experiment must then plan also some means of 
measuring these various other influences, namely, temperature 

^ Fisheb, R. a., The Design of Experiments (1935). 
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and rainfall. Some method must also be devised for discovering, 
in the resulting data, not only how much of the productivity of 
cotton is due to the fertilizer, but also how much is due to the 
differing qualities of the soil, to varying amounts of rainfall, and to 
varying levels of temperature. The design for the experiment 
must plan and organize the procedure so that from the resulting 
data it will in truth be possible to measure the net influence of the 
new fertilizer. 

Where cost is a consideration, and it seldom is not, an impor¬ 
tant part of the design of experiment is to decide to what extent 
to experiment, in other words, how small an experiment will give 
trustworthy results. Before doing this it must be decided how 
much precision in the results, for practical purposes, is required. 

The solution of some of the problems relating to design of 
experiment may be found by applying the theory of statistics. 
The solution to others is a matter of common sense, which some¬ 
times is more difficult to apply than might be supposed. 

Not only in such a case as testing the use of fertilizer, but in 
many problems, the researcher finds that a number of factors 
influence a given result. In agricultural phenomena, weather, 
climate, and other natural and human factors are present; in 
medicine, age, sex, and other conditions affect the application of 
treatment; in biochemical and in psychological experimentation, 
many human and natural variables enter. When it is necessary 
for a given purpose of analysis to isolate one of several influences, 
the data can be so selected or the treatments so applied as to hold 
other influences constant. For example, if age and sex as well as 
inoculation affect the outcome of pneumonia cases, the inocula¬ 
tion can be tested by comparing inoculated and noninoculated for 
those of the same sex or age group. It has become the practice 
to call the noninoculated group the ‘‘control” in the experiment.^ 

Hypothetico-observational Data. In addition to empirical and 
experimental data scientists make extensive use of a third type, 
namely, hypothetico-observational data.^ For example, in the 
physical sciences, that the moon is about 240,000 miles from the 
earth is a hypothetico-observational datum—no one has carried 

* C/. Hill, A. Bradford, Principles of Medical Statistics (1939), pp. 4-8 
and 170-178. 

* Eddington, Sir Arthur, Tim, Philosophy of Physical Science (1939), 
pp. 12-14. 
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out the experiment of measuring the distance from the earth to 
the moon; yet on the basis of certain hypotheses it is measured to 
a comparatively high degree of precision. In the social sciences, 
index numbers purporting to measure such items as the general 
level of prices are hypothetico-observational data. In both 
illustrations, upon the basis of certain hypotheses or theories, 
practical methods are devised for estimating the measurement in 
question. In appraising the resulting estimate, the precision of 
the underlying theory or hypothesis is of primary importance. 

SERVING THE ARTS AND SCIENCES 

Statistics and the Social Sciences and Arts. Politics, Public 
opinion, the opinion of the masses, can be ascertained at any time 
on a mde variety of social and political issues by means of 
statistical data collected by random questionnaires from a com¬ 
paratively small number of people. The employment of sta¬ 
tistical technique for this purpose has stirred the imagination 
and stimulated the ingenuity of students of the social and 
governmental processes. The Avidespread demand for such 
information and the relatively low cost of obtaining it by the 
sampling method have also gratified the acquisitiveness and 
lined the purses of a number of enterprising polling agents. 
Increasingly, political strategists appear to pay attention to these 
systematic statistical studies of public opinion. Both the major 
political parties in the United States have had expert statisticians 
engaged during the quadrennial presidential campaigns to keep 
their fingers on the pulse of public opinion. 

It has been claimed that ‘‘sampling referenda make the mass 
articulate, define the mandate of our leaders, reveal the true 
popular strength of pressure groups, and show social taboos 
quantitatively for what they are worth, , , , ” that they are, in 
the language of journalism, “the fourth dimension for the Fourth 
Estate.’'^ 

Governmental Administration, Statistics are extensively used 
as guides to various kinds of governmental administration, such 
as sanitation, hospitalization, highway supervision, and public 
industrial accident and compensation insurance laws. For exam¬ 
ple, on the assumption that industrial accidents are due to unsafe 

^Gallup, Geobge, “Government and Sampling Referendum,*' Journal 
of the American Statistical Associationj Vol. 33 (1938), pp. 131-142. 
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conditions and unsafe practices that, if eliminated, would 
prevent repetition of the same or similar types of accident, 
statistical data on causes of accident have been assembled. 
Study of these data enables the statistician to identify and select 
the unsafe elements in transportation conditions ai^d then to 
present the data to safety engineers for guidance in accident 
prevention.^ 

From its beginning in 1790 to the present day, the Federal 
government has considered statistics on foreign trade so impor¬ 
tant that an organization has been maintained for the express 
purpose of assembling such statistics. In the early years of the 
republic they were gathered by the Treasury Department, but 
now they are collected and published regularly by the Depart¬ 
ment of Commerce. With the rapid development of large-scale 
business organization in the latter half of the nineteenth century, 
public policy with respect to social and economic conditions has 
required the Federal government to maintain a Bureau of Labor 
Statistics which has been engaged principally in the task of col¬ 
lecting and publishing statistics on prices, cost of living, and 
wages and, in more recent years, on employment and pay rolls in 
manufacturing industries of the United States. 

It is a matter of common knowledge to all who read newspapers 
that important laws are passed by city, state, and Federal gov¬ 
ernments on the basis of statistical facts assembled regularly 
or collected by special legislative committees. For example, the 
Federal Reserve System of banks in the United States was created 
in 1913 after a thorough study, involving extensive use of sta¬ 
tistics, of the banking situation in this and other countries; 
legislation in the decade of the 1930^s on public works, relief work, 
and social security was largely based on studies of a statistical 
nature. 

Business Administration. Statistics are valuable in business 
administration, enabling the manufacturer executive to obtain 
more or less satisfactory answers to such a perplexing question as: 
Making allowance for seasonal changes and expected prices of 
substitute goods, what will be consumer demand for the coming 
year? Some manufacturers must make estimates for a year in 
advance; others can proceed successfully with monthly estimates. 

^Kossoris, M. D., '‘A Statistical Approach to Accident Prevention,"' 
Journal of the American Statistical Association, Vol. 34 (1939), p. 526. 
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Retail-store executives frequently require weekly or even daily 
estimates on some articles, while sellers of perishable vegetables or 
other foods may even have to make hourly estimates. 

When the manufacturing executive has a satisfactory answer 
to the above type of question, he can schedule production to 
maintain as nearly level a rate as is feasible and to keep as 
constant a labor force as possible. In some large business 
enterprises statistics are assembled daily on working capital 
position, factory expense, output, and consumer credit extended. 
Control 'by the executive is kept flexible and timely by a con¬ 
tinuous stream of statistics both on the internal state of the 
business and on external economic conditions. As one rather 
erudite businessman says, There has been an insistence from the 
very top of the organization on getting the facts, so that we might, 
to apply Descartes’s picturesque phrase, 'be clear about our 
actions and walk surefootedly in this life.’ 

In his determination of policies regarding prices, production, 
and employment in his own business, the enterpriser must 
make judgments based upon knowledge of the world of prices in 
which he lives. Prices he must pay for raw materials, for labor, 
for equipment and its upkeep are his guide for determining his 
own production activity and the price he can eventually obtain 
for his product. Since all or at least part of the system of prices, 
that is, the prices he pays and the prices consumers pay for 
competitive or substitute articles, is beyond his control, the 
individual producer adapts his plans to any uncontrollable condi¬ 
tions he finds in the market. It is by the use of statistics that the' 
modem businessman comes to understand conditions to which, if 
he is to profit, he must succeed in adapting his own business. 

During recent years polling agencies have been hired by busi¬ 
ness executives to obtain certain types of information with 
respect to potential markets and changes in consumer tastes or- 
habits. Student groups and student publications on the canb- 
puses of colleges and universities are employed by businessmen to 
make widespread use of polling techniques. It has also 
found that a carefully conducted student poll can do more 
to make college administrators and tmstees cognizant of student 
attitudes toward vital campus issues than the older and less 

^Hatford, F. L., “Some Uses of Statistics in Executive Control,” 
Journal of the American Statistical Associationy Vol. 31 (1936), pp. 31-37. 
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efifective means of circulating petitions. In the large university 
the student poll performs many of the functions of the open 
forum in a small university or college. Similarly, merchants and 
classes in advertising can determine the efficacy of advertising 
by the extent to which students express a preference for branded 
and highly advertised cigarettes, toilet articles, school supplies, 
and items of clothing to the little or nonadvertised varieties. 
The radio programs to which students listen, the magazines to 
which they subscribe, the amounts they spend for varioiJs 
budgetary items, the type of motion picture they most enjoy, 
the mileage they travel, and the means of transportation they 
prefer are typical items of information eagerly sought by adver¬ 
tising organizations and business firms in college and other kinds 
of markets as well.^ 

In a wide variety of practical ways the statistical principle 
of sampling is used in business. For example, by the use of a 
small spectroscope, an entire trainload of pig iron can be tested. 
The spectroscopist opens the car door, fastens a wire to a sample 
pig, strikes an electric arc between this and a bar of pure iron he 
carries, and observes the light in the spectroscope. The bands of 
color in the spectroscope reveal to him whether or not the amount 
of impurity in the pig is below a previously determined standard. 
By properly selecting sample pigs at random the trainload of 
metal can be tested before it is unloaded. ^ In a similar manner, 
though perhaps ^vith less sensational instruments than the 
spectroscope, other types of more or less homogeneous or stand¬ 
ardized goods, such as shipments of ores, grains, oranges, potatoes, 
or lettuce, can be tested by sampling. 

Education. The grading and selection of teachers have in some 
instances been based upon intelligence tests, which have been 
perfected by the use of statistical technique correlating test 
grades with empirical results.^ The scientific use of intelligence 

^ For further illustrations see, for example, W. B. Dygert, Radio as an 
Advertising Medium; H. E. Agnew and W. B. Dygert, Advertising Media; 
E. R. Walter, Effective Marketing; E. H. Schell and F. F. Gilmore, Manual 
for Executives and Foremen; and H. B. Maynard and G. J. Stegemerten, 
Operation Aruilysis. 

2 Harrison, G. R., Atoms in Action (1939), p. 165. C/., on sampling for 

grading cars of iron ore, Stewart H. Holbrook, Iron Brew (1939), pp. 164-165. 

3 West, Michael, ''The Psychology of the Teacher,” Jourmt of Educxh 
iiouj March, 1939, p. 138. 
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tests has developed since the First World War. In 1917 a test 
called the American Army Intelligence Test was given to the 
drafted soldiers. The set of questions included on the Army 
Intelligence Test were based upon the cumulative experience, 
comparatively limited in extent with such tests up to that time. 
The war experience with the tests proved to be a landmark in 
their development in that it constituted a major experiment 
in their use and stimulated rapid development in the principles of 
their use.^ Subsequently, the art of constructing questions for 
testing intelligence, now called “aptitude’’ in order to contrast the 
testing of natural ability with the mere testing of acquired ability, 
has greatly progressed. In addition to the college-entrance tests, 
which in part measure opportunity, scholastic-aptitude tests are 
used by the leading universities as a basis for selecting students. 
As a consequence, statistical data that measure not only acquired 
intelligence but also native ability, or aptitude, are being accumu¬ 
lated. The aptitude-test rating is often called the “intelligence 
quotient,” or simply I.Q. 

Mental tests have most frequently been employed with the 
feeble-minded in connection with problems of detection and place¬ 
ment and for determining the type of training best suited to 
individual persons. Studies of criminals by the use of intelligence 
tests disclose relationships between intelligence and the type 
of crime committed, but apparently a high I.Q. neither prevents 
nor stimulates crime in general. Delinquent children have been 
found to exhibit more neurotic traits than do unselected school 
children. Tests of emotional control, dishonesty, and lack of 
self-control have been found useful in forecasting incorrigibility 
among delinquent children. 

Recently a study was made in which the I.Q.’s of 214 foster 
children, all of whom were adopted before the age of twelve 
months, and of 105 control children living with their own parents 
were compared with the I.Q.’s of the foster and real parents. The 
I.Q. of the parents was supplemented by information on occupa¬ 
tional status and other pertinent data. Information regarding 
the true parents of adopted children were secured from placement 
records. There was far greater correspondence between the 

^Bbigham, Gael C., “Two Studies in Mental Tests,^' Psychological 
RemeWf Psychological Monographs, Vol. 24 (1917); A Study of American 
Intelligence (1923); A Study of Error (1932). 
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I.Q/s of foster children and their true parents than between 
the I.Q/s of foster children and their foster parents. It was 
estimated by statistical techniques that the contribution of 
heredity to individual differences in I.Q. is probably not far from 
70 to 80 per cent and that the very best environment might, how¬ 
ever, raise I.Q. as much as 20 points, vrhile the poorest environ¬ 
ment might lower it as much as 20 points.^ 

Sociology. Modern sociology employs statistical method 
almost to the exclusion of other methods. This may be a mis¬ 
taken emphasis that Avill be corrected by future sociologists, 
but in that discipline the twentieth-century reaction to nine¬ 
teenth-century abstraction was particularly great. Moreover, 
this extreme emphasis upon fact by American sociologists^ can 
be traced to the picturesque Lester Frank Ward, who, despite 
the abstract qualities of his Avriting, emphasized the statistical 
approach. A farmer, a Civil War soldier, a Federal government 
official, a laAvyer, a botanist, a chief of the Division of Navigation 
and Immigration, and, finally, toAvard the end of his life, a pro¬ 
fessor of sociology at BroAAm University, Ward came to the study 
of sociolog}^ AAith a richly varied experience. Among the vioices 
raised against nineteenth-century emphasis on nature and the 
neglect of humanity his Avas the most vigorous. So eager was the 
reading AAorld for this neAV approach that some of Ward^s books 
Avere translated into every Continental tongue.^ 

Economic Theory. From Adam Smith to the present time 
economic theory has been, at least in part, an inductive science. 
In Adam Smithes day there AA^ere few statistics, but he made 
extensiA^e use of trade, price, and Avage data in his analyses. In 
modern times, especially since the turn of the century, more 

^ Study by Miss Burks, described in H. E. Garrett and M. R. Schneck, 
Psychological Tests, Methods, and Results (1933), pp. 189-190. 

^Lynd, R. S., and H. M. Lynd, Middletown: A Study in Contemporary 
American Culture (1929); Middletown in Transition: A Study in Cultural 
Conflicts (1937). These remarkable books are modern classics in sociology 
and are based almost entirely upon observational method largely statistical 
in character. 

^Chugerman, Samuel, Lester F, Ward, The American Aristotle (1939). 
Because of Ward's optimistic views it has been suggested lately that he should 
be widely read both for information and encouragement. Cf. a review of 
Chugerman's book by Prof. Rudolph Binder in The New York Times Book 
Review, Oct. 15, 1939, p. 10. 
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complete statistical data are available, and an increasingly 
important volume of statistical writings that have significance 
in economic theory has been forthcoming. The theory of alloca¬ 
tion of income to the owners of capital, to the laborers, and to the 
enterprisers, as well as other comprehensive economic hypotheses 
such as business-cycle theories and theories of money and prices 
are today being tested by careful statistical studies.^ In addi¬ 
tion, theoretical questions concerning the factors determining par¬ 
ticular prices are being studied by the use of statistical methods. 

Mathematical Economics. In recent years the subject of 
statistics has become closely related to the mathematical approach 
to economic theory. Starting with the nineteenth-century work 
of Cournot a group of mathematical economists have attempted to 
work out a purely abstract theory of economics by using the 
shorter and more precise methods of mathematical reasoning. 
The origin of this school of economists, often called the ^^mathe¬ 
matical school of economists,was independent of the develop¬ 
ment of statistics. As statistical methods became more refined 
and economic data more plentiful and more accurate, the mathe¬ 
matical school of economists turned to statistics to derive demand 
curves and supply curves from the actual statistical events of the 
market place. This development during the 1930^s was one of 
the most sensational and also one of the most controversial 
contributions to economics, and it continues to be energetically 
discussed in scientific meetings and journal articles by proponents 
and opponents of the methods used. Meanwhile, without waiting 
for the issue to be settled by the theoreticians, business enterprise 
and government, and notably the United States Department of 
Agriculture, are making extensive practical use of statistical 
demand and cost curves.^ 

Statistics and the Natural Sciences and Arts. Astronomy. 
One of the foundations of statistical theory, the method of least 

^ Cf. . National Bureau of Economic Research, Studies in Income and 
Wealth, Vols. 1-3. National Resources Committee, The Structure of the 
American Economy (1939), Part I, Basic Characteristics. Tinbebgen, J., 
A Method and Its Application to Investment Activity (1939); Business Cycles 
in the United States of America, 1919-1932 (League of Nations Economic 
Intelligence Service, 1939). 

* Ezekiel, M., and L. H. Bean, Economic Bases for Agricultural Adjust- 
menl Act (1933); Schultz, Henry, Theory and Measurement of Demand 

(1938). 
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squares, was first discovered and applied in astronomy early in the 
nineteenth century. The method continues to be employed in 
astronomy to trace the paths of stars, comets, planets, and other 
heavenly bodies. Modern astronomy deals with large numbers 
of observations, which become the statistical raw material for the 
science. For example, the Harvard College Observatory receives 
monthly, from nearly one hundred different observers distrib¬ 
uted the world over, and on report blanks containing seven 
to seven hundred observations each, an average of forty-five 
hundred observations. It has been found best not to attempt to 
analyze each observer’s work separately, but instead to depend on 
multiplicity and frequency of observations well distributed 
throughout, to obtain the best possible light curves. Over fifty 
thousand observations come to the Harvard College Observatory 
each year, and from 1911 to 1939 it collected three-quarters of a 
million observations.^ 

For years the Smithsonian Institution has been using methods 
essentially statistical in nature to record measurements of the 
amount of heat received from the sun by the earth. Smithsonian 
stations in three of the most arid regions of the earth are daily 
recording the sun’s radiation. Observers in Chile, in South 
Africa, and in Western United States have been taking records. 
According to these observations, which have been made at widely 
separated stations, correlations exist between changes in solar 
radiation and temperatures on the earth. Study of these records, 
study of records of the earth’s weather as recorded in the growth 
rings of trees, and study of similar phenomena have revealed 
recurrent cycles in the weather that may be of great value in 
foretelling long-range trends in the future succession of fat and 
lean years. ^ 

Zoology. A considerable amount of the experimental work in 
the life sciences involves such quantitative considerations as 
weights, measurements, enumerations, pointer readings of various 
kinds, comparisons, and classifications. If the results arrived at 
by experimentation are to give rise to general principles rather 

1 Cf. Campbell, Leon, ‘‘The Light Curve of SS Cygni, 213843,” Annals 
of Harvard College Observatory, Vol. 90, No. 3, pp. 93-162; Sterne, T. E., and 
Leon Campbell, “Properties of the Light Curve of SS Cygni,” ibid., Vol. 90, 
No. 6, pp. 189-206. 

* So says G. R. Harrison, op. di., pp. 290-291. 
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than just to meaningless and incoherent single observations, the 
zoological data must be consistently assembled, uniformity of 
units must be observed, and the data classified. In other words 
statistical method must be used to bring order from isolated 
chaotic measurements. 

In addition to routine problems of analysis in zoology, sta¬ 
tistical and mathematical devices have had interesting applica¬ 
tions in certain special problems. For example, in 1934 Zeuner 
used a statistical study of a system of cranial angles as a basis for 
biological inferences regarding rhinoceroses; in 1930 Soergel 
emphasized the importance of statistical methods for certain 
paleontological problems, employing numerical and mathe¬ 
matical procedures to study footprints and from these dra^ving 
inferences regarding the animals that made them; and in 1912 
Ridgway attempted to put the study of faunal coloration on a 
statistical basis. ^ Paleontologists use various mathematical 
and graphic means to restore missing parts in fossil animals and 
to reconstruct hypothetical intermediate stages between less and 
more specialized animals. They also use statistical methods 
to study averages and variation in characteristics of different 
age groups, rate of growth, and the like, of various animals.^ 

Biology. Considering the modern emphasis on statistics in the 
social sciences it is interesting to note, not only that the method of 
least squares was first applied in a natural science, but also that a 
second highly important statistical method was first developed in 
the natural science of biology. This is the statistical measure¬ 
ment of correlation, which in the 1870^s was used by Sir Francis 
Galton to measure the effect that characteristics of midparents— 
that is, the average of their two parents—had on their children.^ 

Biological experimentation in the nineteenth and twentieth 
centuries involving as it does rats, guinea pigs, and the like, 
makes use of procedures that combine the laboratory test with 
the assembling of statistical data and their subsequent analysis. 
In this way, the effects and incidence of various diseases and 

1 Soergel, W., Die Bedeutung variationsstatiatischer Untersuchungen fur 
die Sdugetier—paldontologiej Bund 63, pp. 349-450; Ridgway, R., Color 
Standards and Color Nomenclature (1912). Also see Simpson, G. G., and 
Anne Roe, Quantitative Zoology j (1939), pp. 24, 404r-406. 

* Simpson and Roe, op. dU, p. 335. 

3 See Chap. XIII. 
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of various cures for those diseases are measured; thus also are 
tested the various theories regarding the r.elative importance 
of hereditary factors as compared with environmental factors in 
animal life. Much of this experimental work later becomes the 
basis for theories regarding human life and for theories in respect 
to the effects of human diseases and their cure. 

Some problems in biology have interesting applications to 
the homely arts of livir^. A recent illustration of the use of 
statistics in biology is the standardizing of liquid household 
insecticides, a matter of considerable importance to certain 
private enterprisers engaged in the business.^ By a series of 
experiments that established the sex ratio of houseflies statis¬ 
tically, hitherto unknown sources of variability in the effects 
of insecticides were thrown into bold relief. It was found, 
for example, that flies at ages of less than three days vary con¬ 
siderably in their reaction to the spray, while flies four to six 
days old exhibit a fairly constant susceptibility. It was known 
that male houseflies are markedly more susceptible to certain 
sprays than female houseflies. 

A recent book on heredity^ illustrates the extent to which 
biology depends upon statistical technique. Widespread interest 
in the Dionnes led biologists to calculate the probability of 
quintuplets as compared with the probability of twins. The 
probability of quintuplets is 1/41,600,000, while that of twins 
is In addition, the statistical method was used in an interest¬ 
ing way to answer the question of heredity vs. environment, 
epitomized in the highly talented musical family of Johann 
Sebastian Bach, a talent that ran through five generations. 
Were the Bachs musical because of inborn talent or because 
of the musical environment in the home? To answer this 
question the author of the above-mentioned book resorted to 
statistical technique. He obtained information from 36 out¬ 
standing instrumental musicians, from 36 principals of the 
Metropolitan Opera Company, and from 50 students of the 
Juilliard Graduate School of Music. From facts obtained by 

1 Campbell, F. L., G. W. Snedecor, and W. A. Simonton, ^'Biostatistical 
Problems Involved in the Standardization of Liquid Household Insecti¬ 
cides/^ Journal of the American Statistical Association^ Vol. 34 (1939), 

pp. 62-80. 

* ScHEiNFELD, A., You and Heredity (1939). 
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questioning these persons, the author concludes that their 
talent is largely inherited. Many will welcome this treud toward 
basing studies of man upon statistics of human beings rather 
than upon statistics of vegetables or fruit flies. 

Medicine, Much of the statistical work in biology has 
application in the field of medicine, and interest in statistics 
on the part of the medical profession has increased. In addition, 
the medical profession has become interested in statistics on 
economic and social welfare, factors of importance in the control 
of epidemics, and of certain types of disease in the modern com¬ 
munity.^ The practical advantages to the physician and to the 
sanitarian of the development of medical statistics are very 
great. Matters that were fiercely debated two generations ago 
and concerning which only few physicians of a hundred years ago 
could form an opinion are now a regular part of the knowledge 
of a junior medical student through the study of mortality 
statistics and vital statistics. ^ Indeed, the medical profession 
in England has recently contributed a textbook on medical 
statistics designed to acquaint medical students with the funda¬ 
mentals of statisticaf theory.® 

Engineering, Since the success of their work depends not 
only on the machines but on the human beings who operate them, 
mechanical engineers have become increasingly interested in the 
use of statistical method for making time studies in machine 
operations. It is now realized that such studies cannot be 
safely based upon some a priori scale of the machine's capacity 
or upon the record of only one or two operatives. Rather, 
time-study data must be collected from an entire group of 
operatives so that adjustments can be made according to the 
effects upon operation of the human traits found by statistical 
study to prevail in the machine or in the manner of operation.^ 

Two simple examples of the application of statistics to electrical 
engineering are the study of elevator capacity for buildings and 

1 Cf, Davis, Michael M., ‘‘Wanted: Research in the Economic and Social 
Aspects of Medicine,” The Milhank Memorial Fund Quarterly, October, 1935, 
pp. 339-346. 

* Cf, Pearl, Raymond, Introduction to Medical Biometry and Statistics 
(1923), pp. 2, 38. 

® Hill, A. B., Principles of Medical Statistics (1939). 

^ Bergen, H. B., “Scientific Management in Unionized Plants,” Mechani¬ 
cal Engineering, March, 1938, pp. 235-240. 
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telephone calls to be handled by an exchange. Statistics 
regarding the number of passengers taken on at the first floor 
are used to determine the time required for passengers to leave 
the elevator, the round-trip time, and the number of passengers 
carried by a given elevator. The most desirable type of elevator 
equipment to install is determined from such data.^ 

Since engineers are dealing with natural phenomena that 
cannot be affected by human bias, many of their problems can 
be solved approximately by the application of the principles 
of probability. For example, during a long period of gaugings of 
a stream the frequency of floods is often the best indication 
of probable future floods. Such important engineering data 
as forecasts of future floods, low annual rainfall, and consequent 
depletion of storage reservoir can be estimated by applying 
the theory of probabilities to statistics on the past history of 
such events. From such data, the use of statistical technique 
makes it possible to estimate the proper size of a hydroelectric 
power plant and to predict its output and earnings. ^ 

One of the most striking illustrations of the use of statistics 
in engineering is the control of the quality of manufactured 
products.^ In ordinary manufacture, with the exception of 
the making of optical or other precision instruments of infinite 
refinement, all units of a product are not identical, in spite of 
the vaunted standardization of products in industry. The cost 
of so refining the machines or of so regulating their operation as 
to make all units of product identical would be prohibitive 
and in most cases unwarranted because of the low market value 
of the product. Variations in quality are thus considered to be 
justified, and it is the purpose of quality control to develop 

1 Cook, H. B., Selecting Elevators for an Office Building,” Power, Mar. 8, 
1932, pp. 404-408. 

2 Creager, W. P., and J. D. Justin, Hydro-electric Handbook (1927), 
pp. 43 and 171. For other illustrations of the use of statistics in engineering 
science and art, see C. W. Hubbard, “Investigation of Errors of Pitot 
Tubes, Transactions of the American Society of Mechanical Engineers, 
August, 1939, pp. 477-506; H. K. Barrows, Water Power Engineering (1927), 
pp. 54-57. 

3 Shewhart, W. a.. Economic Control of Quality of Manufactured Product 
(1931). Since Shewhart's pioneer efforts on this important subject, much 
progress has been made, so much that one might say a new craft has been 
created. 
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statistical means of showing the actual statistics of variations 
in quahty, the economically permissible variations in quality, 
and the statistical measurement of ways of locating and cor¬ 
recting causes of quality variation beyond set limits. Such 
control is designed to reduce the number of products that 
must be discarded as below standard; consequently, if successful, 
quality control reduces waste and lowers manufacturing costs 
per unit of output. In addition, selling costs are reduced and 
good will improved, because quality control decreases the 
number of customers who become dissatisfied as the result of the 
inconvenient necessity of returning inferior products. 

Although used in the American Telephone and Telegraph 
Company under the leadership of Dr. Shewhart, application 
of statistical quality control has been negligible in the United 
States.^ In Great Britain, however, the idea of statistical 
quality control was accorded an enthusiastic reception following 
Shewhart’s visit to London in 1932. A committee headed by 
Dr. E. S. Pearson was organized by interested British indus¬ 
trialists to consolidate previous progress and facilitate adoption 
of the technique.^ By 1937 in England the methods had been 
applied to coal, coke, cotton yarns, cotton textiles, woolen 
textiles, spectacles glass, lamps, building materials, and manu¬ 
factured chemicals.® 

Physics and Chemistry. There is no dispute among modern 
physicists and chemists as to the importance of statistical 
methods in their sciences. Even the highly metaphysical Sir 
Arthur Stanley Eddington in his Nature of the Physical World 
(1928) attaches great importance to statistics in the natural 

1 Two reasons have been given for this failure of statistical quality control 
to be applied in the United States: ‘‘[first,] a deep-seated conviction of 
American production engineers that their principal function is so to improve 
technical methods that no important quality variations remain, and that in 
any case the laws of chance have no proper place among modern ‘scientific’ 
production methods; second, . . . the difficulty of obtaining industrial 
statisticians who are adequately trained in this fairly complicated field.” 
Freeman, H. A., “Statistical Methods for Quality Control,” Mechanical 
Engineering^ Vol. 59 (1937), pp. 261-262. 

* Pearson, E. S., The Application of Statistical Methods to Industrial 
Standardization and Quality Control (British Standards Institution, London, 
1935). 

3 Freeman, op. dt., pp. 261-262. 
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sciences. In fact, he says that the laws of nature divide them¬ 
selves into three classes, (1) identical laws, such as the law of 
conservation and the law of gravitation; (2) statistical laws, 
such as Boyle^s law, the second law of thermodynamics, and 
quantum laws; and (3) transcendental laws, whicli are '^genuine 
laws of control in the physical world.^'^ 

In physics, statistical technique is employed in the study of 
molecules. This modem statistical approach has a philosophical 
background that goes back at least as far as Boltzmann, who 
in 1866 expressed the second law of thermodynamics in terms of 
probabilities. His contribution was regarded as a form of 
mysticism until it was demonstrated by research during the 
first two decades of the twentieth century. ^ At the turn of the 
twentieth century Max Planck was trying to explain why pieces 
of matter heated to high temperatures emit more light of one 
wave length than of any other and less light at both larger and 
shorter wave lengths. He could not explain this phenomenon 
except by supposing that light is emitted by atoms not as con¬ 
tinuous trains of electromagnetic waves but in discrete bundles 
of energy that he called quanta.’^ Similar experimental 
work accompanied by new theoretical contributions, notably 
those of Heisenberg, led to the formulation of the modern 
statistical approach to the natural sciences. Within three 
decades this new theory has come into widespread practical 
use also, having found application in explanation of the behavior 
of photographic plates, the conduction of electricity through 
wires, the conduction of heat through walls, the behavior of 
photoelectric cells, the manner of emission and absorption of 
light by atoms and molecules, and in the theory of metals.^ 

As explained by a recent popularizer of the natural sciences,^ 
Newtonian mechanics succeeds in accurately predicting motion 

^ Stebbing, L. S., Philosophy and the Physicists (1937), p. 70. 

* Haas, Arthur, The New Physics (1923), pp. 38-44. 

® Cf. Eldridge, J. a., The Physical Basis of Things (1934), pp. 357—358. 

^ De Broglie, Louis, Matter and Light. The New Physics^ translated by 
W. H. Johnson (1939). For a popularized description of the experimental 
development based upon Boltzmann and later Heisenberg’s theories, see 
also Eddington, op. cit. Cf. William M. Malisolf, review of De Broglie’s 
Matter and Lights in The New York Times Book Review, Oct. 1, 1939. Also 
see H. Lifschutz and 0. S. Duffendack, “The Counting Losses in Geiger- 
Mtiller Counter Circuits and Recorders,” Physical Review, Vol. 54 (Nov. 1, 
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occurring on the human scale and also on the scale of celestial 
bodies. In other words, Newtonian mechanics does well for 
macroscopic measurements. But in the investigations of the 
motion of the microscopic particles inside the atom, Newtonian 
mechanics ceases to have value, while quantum mechanics makes 
it possible to grasp the meaning of new principles that must 
necessarily be introduced in these more minute analyses. The 
principles referred to are statistical in nature and are based 
in large part on the theory of probability. ^^It is impossi¬ 
ble to measure several physical quantities (as energy, position, 
momentum) accurately at the same time. It is this neces¬ 
sary inexactness that has forced us to find our ultimate laws in 
probabilities.^’^ 

It must not be supposed that the new statistical approach, 
which is said to have been derived from Heisenberg’s uncertainty 
principle, necessarily has throwm into chaos concepts of physical 
measurement. The admission that laws in quantum mechanics 
are statistical may destroy the idea that the universe is a huge 
machine; but in a given case, with the initial conditions deter¬ 
mined as precisely as the principle of uncertainty permits, the 
probability of all subsequent states is determined by exact 
mathematical probabilities. There is nothing lawless in quantum 
phenomena.^ Analysis shows, moreover, that the theoretical 
uncertainty, which prohibits a simultaneous accurate measure¬ 
ment of position and of velocity, is noticeable only in dealing with 
the very minute masses of the subatomic world. With ordinary 
masses, the theoretical uncertainty, though still existing, falls 
below the practical uncertainties, which are due to the imperfec¬ 
tion of human observations, and is completely submerged by the 
latter. This gradual obliteration of the quantum uncertainties, 
as the scientific observer passes to the commonplace level of 
average masses, is the reason why Newtonian mechanics is still 
used. For the small velocities and relatively large masses with 


1938), pp. 714-725; A. Ruark, ‘^The Time Distribution of So-called Random 
Events,’* Physical Review j vol. 56 (Dec. 1, 1939), pp. 1165-1167; E. R^ 
Rutherford, Radiations from Radioactive Substances (1930), Chap. VII, 
pp. 171-172. 

1 Eldbidge, op. cit.f p. 376. Cf. Tolman, R. C., The Principles of Sta* 
tistical Mechanics (1938), p. 65. 

* Stebbing, op, cit., p. 183. 
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which the scientist is usually concerned, the two mechanics yield 
results so nearly alike that in practice no experiment would be 
sufficiently refined to detect the difference. Since the Newtonian 
mechanics is mathematically the simpler, there is every advantage 
in retaining it.^ 

Except for Einstein’s theory of relativity there has been nothing 
so to stir the imagination of the natural scientists in the twentieth 
century as this new statistical approach. In fact, one writer has 
said that the entire structure of modem physics and chemistry, 
and therefore of all the natural sciences to which they are funda¬ 
mental, rests upon quantum mechanics.^ 

From the above discussion it is readily apparent that statistical 
techniques are helpful, not only to theories in the natural and 
social sciences, but to the arts dependent on those sciences. Yet 
for many students the most important reason for knowing some¬ 
thing about the fundamentals of statistical method is the need for 
intelligent discrimination between the proper and improper use of 
statistics. Unfortunately, a large portion of the extensive 
modern employment of statistics in all fields falls under the latter 
heading. This is especially true in popular presentations of 
modem scientific and political matters. Too close attention to 
the mechanics of a method and the neglect of common sense are 
responsible for a large number of these horrible examples. All too 
often, preoccupation with the technique dims common sense. 

Statistics and Philosophy. Nineteenth-century cocksureness 
of the scientific approach, pretending to such a degree of precision 
and to such broad scope as to annihilate the foundations for 
ethical, moral, and religious faiths, has largely disappeared. 
Under the aegis of the assertive and materialistic science of the 
nineteenth century, belief in free will was dwindling to a mere 
superstition; but the element of indeterminacy brought into 
science as a result of the application of the theory of probabilities 
again permits freedom. This decline of mechanistic assurance in 
science has not been ignored by philosophic thought, which has 
emphasized as never before a lesson that has often recurred in the 
history of philosophy: objective reality is not always identical 
with subjective concepts.^ Eddington expresses these doubts 

1 D’Abro, a., The Decline of Mechanism (1939), pp. 37-57. 

® Harrison, op. dt., pp. 341-342. 

® Eldridqb, op. dt.^ pp. 379-380. 
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in the following words''Does the spectroscope the colors, 
or does it make them? When the late Lord Rutherford showed 
us the atomic nucleus, did he find it or did he make it? How 
much do we discover and how much do we manufacture by our 
experiments?'^ 

Just as surely as the railroad destroyed the supremacy of the 
stagecoach or the radio eclipsed the popularity of the phonograph, 
so have the discoveries of modem science eclipsed faith in many 
ideals and beliefs that served to give reason to the lives of the 
masses of the people. In the realm of ethical and moral values, 
buttressed by the dogma of a bygone age, nineteenth-century 
scientific method was almost wholly destructive and hardly at all 
constructive. Modem philosophy criticized scientific method, 
both the laboratory and statistical branches, for failing to provide 
new moral values to replace outmoded prescientific ones. Despite 
this gloomy aspect, philosophy's greatest spokesmen look to 
scientific method itself to obtain the necessary enlargement 
of the conception of human nature and the formulation of the 
required new moral values. John Dewey envisages the use of 
scientific method to create a comprehensive democratic culture as 
a guarantor of genuine freedom.^ 

SUMMARY 

Statistical method—the quantitative expression of knowledge, 
the marshaling of facts and their arrangement in a form suitable 
for scrutiny—^has been the means employed by businessmen, 
natural scientists, and social scientists to establish bases for 
judgments regarding factual data so complex or so numerous as 
to be, in the unmarshaled state, intellectually incomprehensible. 
Commercial statistics and their interpretation may, indeed, be 
said to constitute the scientific background of business today. 
Men cannot conduct their business intelligently mthout them. 
Quite as important as statistics of commerce and trade are the 
more recently developed industrial and social statistics, data on 
employment and pay rolls in industry, trade, and finance and on 
the distribution of income. 

In the science of government and its practical art the sta¬ 
tistical approach has proved itself essential as facts have accumu- 

* Op. dt.f pp. 108-109. 

* Freedom and Culture (1939), passim. 
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Jated and, to an increasing extent, as means have been developed 
for making the quantitative units of measurement required. 
The importance of statistical techniques in the natural sciences is 
attested to by the definition of scienceso familiar to every 
schoolboy: Science is systematized knowledge. Statistical 
mechanics is essential to an understanding of modern physics 
and chemistry. Whatever the individuaPs station or calling, he 
is to a greater or lesser extent using statistical techniques. 



CHAPTER II 


GATHERING STATISTICS 

Before the commercial revolution of the sixteenth century, 
social and economic life was relatively simple. The small villages 
and towns were self-sufficing economic and social units. Little or 
no statistical enumeration of facts was required to comprehend 
the extent of the population, the number of buildings, the number 
of cattle, and the quantity of other constituent units in the com¬ 
munity. Within the limited range of space and time usually 
contemplated, events having to do mth the welfare or distress of 
the community were not complex. Judged by modern standards, 
government was simple and inexpensive because social and 
economic relationships were not complicated. Even the great 
cities of the time were not large compared with modern metro¬ 
politan districts. In population, wealth, and trade, the extent 
of a sixteenth-century nation was inconsiderable and, furthermore, 
was growing almost imperceptibly. In other words, conditions 
were relatively simple and static. 

Genesis of Fact Marshaling. Under such conditions, little 
was done in the way of the systematic gathering and analysis 
of statistical data. The situation did not demand the con¬ 
tinuous assembling of up-to-the-minute facts. Indeed, it was 
not profitable to do so. The motive did not exist in sufficient 
force to direct attention to the problem of expressing quanti¬ 
tatively the events of contemporary social and economic life; 
and the facts of the natural sciences were obscured in medieval 
mysticism or cherished from a forgotten age by a few scattered 
and scholarly churchmen. Nevertheless, it was found useful 
on occasions to make great surveys that could subsequently 
serve as the basis of governmental decision in regard to taxation 
and other social activities, and that might also ^be a guide to. 
private enterprise. Pepin the Short in 758 and Charlemagne 
in 762 demanded detailed descriptions of church lands, while 
several works written in France during the first ha l f of the ninth 

24 



GATHERING STATISTICS 


25 


century gave a partial enumeration of the serfs attached to the 
land.^ Likewise, when William the Conqueror undertook the 
reorganization of the national government of England in the 
eleventh century, he found it desirable to make his famous 
survey, which resulted in Domesday Book, completed about 
1086.^ Also, as early as the fourteenth centurj^, the medieval 
guilds gathered statistics in connection with their regulation of 
markets.® Later, in the fifteenth century, as the breakup of 
the medieval system gathered momentum and as the rise of 
trading groups accelerated, there was a great increase in the 
amount of statistical work done by guilds as well as by central 
governments—the latter not infrequently through guild organiza¬ 
tions or through the Church. Economic statistics were collected 
when the occasion demanded, for example, when the upsetting 
of a customary price by a flood or drought required explanation 
and the determination of a new customary price. Although 
there is evidence that in these several ways statistics were 
assembled, they were neither methodically made nor preserved. 
There are isolated instances of the registration of deaths or 
baptisms in the fourteenth and fifteenth centuries, but it was 
not until the sixteenth century that any considerable movement 
toward statistical enumeration of facts occurred. 

Development of a Dynamic Social Order. During the Renais¬ 
sance, from thirteenth-century Italy to fifteenth-century Spain 
and England, the quantity of data in the physical sciences 
steadily accumulated from experimental efforts of astronomers 
and other scientists. The most dramatic of all human experi¬ 
ments was made by the voyagers seeking to prove that the world 
is round. The discovery of America and the voyages of explora¬ 
tion of the sixteenth century gave great impetus to the develop¬ 
ment of trade and the growth of nations.^ Motivated by the 
economic ideals of mercantilism, a period of trade development 
followed, the domestic system of manufacture rapidly expanded, 

^ Walker, Helen M., Studies in the History of Statistical Method (1929), 
pp. 32-33. The History of Statistics (compiled and edited by John Koren, 
1918). 

* Cheyney, Edward P., An Introduction to the Industrial and Social 
History of England (1925), pp. 17-18. 

®Faurb, Fernand, ‘^The Development and -Progress of Statistics in 
France,” in Koren, History of Statistics, pp. 22^233. 

* Faulkner, H. U., American Economic History (1931), pp. 34-57. 
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and the colonial empires of Portugal, Spain, France, Holland, 
and then England, emerged. The change was from haphazard 
and occasional trade of the merchant adventurers of the sixteenth 
century to the more or less systematic and regular international 
and intercolonial trade of the seventeenth and eighteenth cen¬ 
turies. Along with this trade development came the necessity 
for obtaining more regular information concerning markets, 
population, wealth, prices, and the movements of merchandise 
and gold. Furthermore, with this growth of trade both in 
volume and in complexity, governmental and social organization 
became more complex. 

As the fact of change was revealed by the events of the com¬ 
mercial revolution, national governments began to feel the 
need of more regular fact finding in order to visualize and to 
interpret changing conditions. Yet it must not be supposed 
that well-organized or any considerable amount of statistical 
data for the sixteenth or even the seventeenth century can now 
be found. It was more a case of an awakening of the will to do 
rather than a case of actual accomplishment. For it was really 
the Industrial Revolution and the vigorous grovi;h that took 
place in the eighteenth and nineteenth centuries that gave the 
actual impetus to systematic marshaling of quantitative facts. 
It was not until the early part of the nineteenth century, indeed, 
that most of the essential principles of statistical method, even 
for purely descriptive purposes, had been evolved. Also, the 
compilation and current use of statistics as practiced today 
have been made possible only by the growth in transportation 
and communication facilities, a nineteenth-century phenomenon. 
It was also from the eighteenth century onward that the achieve¬ 
ments of scientists in accumulating experimental data for the 
natural sciences fired the imagination of scholars to solve the 
problem of data accumulation for the sciences of life and social 
behavior. 

Quantitative Expression of Facts. Where there are large 
populations, great nations of tens of millions of people, all 
problems of social, economic, and political organization are 
increased many times in complexity and, furthermore, new 
problems arise. The problem of feeding such large populations, 
the problem of housing them, the problem of keeping them 
employed and preventing them from harming each other, to 
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mention but a few of the considerations confronting the govern¬ 
mental administrator—all these are vastly complex owing to the 
great expanse of geographical space covered and the varying 
conditions at different places and times. In simple economies 
many of these problems can be solved by permijiting individual 
freedom of choice and free economic enterprise; but as the 
community becomes more and more closely knit in economic 
and social relations and as various forms of economic power 
emerge, individual freedom of choice and free economic enter¬ 
prise become goals that must be consciously sought by organiza¬ 
tion rather than natural tendencies that develop unaided.^ 

In the intricate social and economic organizations of the 
modern era, it is inconceivable that any individual or group of 
individuals can obtain the knowledge necessary to form judg¬ 
ments concerning the issues that arise. An individual can 
comprehend only those conditions within a reasonable geo¬ 
graphical area about him; the more complicated society is, the 
smaller the area about him that he can understand without the 
use of statistics. We are overwhelmed, not only by the diver¬ 
sity of knowledge, but also by the diversity of possible deeds, of 
possible values, and of possible judgments,^^ and, further, ‘Hhis 
human mind, whose needs Plato so perfectly understood, still 
insists upon constructing for itself a fixed world in the midst 
of a fluid one. It persists in thinking in terms of aims and 
ends and perfections; of ideals, of purposes, and final goods; 
and, at the very last, it insists upon assuming some direc¬ 
tion in change, something toAvard which the chain of events is 
moving. 

In this effort it is impossible for the individual to survey the 
conditions qualitatively—it would take him many human life¬ 
times to inspect the whole population, and the capacity of the 
human brain is not adequate to the task of absorbing so complex 
an impression. If he attempts a microscopic survey, he is 
quickly smothered by overwhelming detail. If he attempts a 
macroscopic survey without the use of statistics, he is compelled 

^ For the complexity of modern society, as it is reflected in statistics, see 
publications of the United States Bureau of the Census. For suggestive 
special studies, see Corrington Gill, Wasted Manpower (1939); Henry Pratt 
Fairchild, People (1939). 

* Krutch, J. W., Art and Experience (1932), pp. 121, 211. 
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to resort to guesswork and commonly originates ‘'cloud-push¬ 
ing'' fantasies. Furthermore, the individual's personality tends 
to bias him not only in his observations but also in his judgments. 
If he is temperamentally inclined to be impressed by sordid 
things, he is likely to notice them more than the good things in 
his surroundings, and his judgment is correspondingly pessi¬ 
mistic. On the other hand, if he is temperamentally optimistic, 
he tends to consider as the rule the good things and to regard 
as unusual the sordid things of life. Where it is necessary to 
gain knowledge concerning large populations of people and 
things, where social and economic life is complex, there is need 
to use statistics. 

Rational Basis for Gathering Data. Quixotically, accumulating 
data is not to be confused with scientific fact gathering. The 
progressive accumulation of useful quantitative facts* has been 
stimulated and furthered by a definite, conscious purpose. 
To look at the process historically, it was the rise of nationalism 
and the mercantilistic ideal that supplied definite purpose for 
the fact-finding inquiries of the eighteenth-century political 
arithmeticians. Modern survivals of the same nationalistic 
and mercantilistic ideals impel governments to spend vast 
sums for the collection of statistics designed to measure the 
wealth and material position of the nation and to furnish business 
enterprise mth facts about markets. Underlying much of this 
effort abides also a sincere interest, stimulated by scientific 
research, in real human welfare. As a consequence, the modern 
census attempts to collect quantitative facts directly or indirectly 
concerning the health and morals of the nation's people. The 
subsequent usefulness of such statistical data depends upon 
how well the simple rules of common sense have been followed 
in assembling and in presenting the data. 

Units of Description and Measurement, The units of descrip¬ 
tion or measurement by means of which quantitative facts 
are to be assembled must be carefully defined. When defined, 
such units must be strictly applied during the assembling of the 
data and in all subsequent analysis. It is accordingly of the 
utmost importance clearly and fully to describe units of descrip¬ 
tion and measurement in all subsequent use of the data. Such 
rules are so clear and so easily resolved into matters of simple 
comjnon sense that it seems almost a waste of time to direct 
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attention to them; yet to follow them is not always so simple a 
matter as might be supposed. 

For example, in 1940 thousands of enumerators undertook 
the task of counting the population of the United States, of 
counting the number of farms, farm animals, and all other types 
of wealth, and of obtaining specified information concerning 
every person living in the United States. One may ask: Why 
should mere counting be a complicated task? This question 
would be quickly answered by the farmer’s boy who has just 
finished trying to count the number of chickens in his pens. 
Everything would be easy if they would only stand still. People, 
as well as chickens, do not stand still while they are being 
counted, and simple matters mount up into a veritable host of 
intricate difficulties. Suppose you were an enumerator and 
in the first house approached you found that the mother is in 
the maternity hospital, a baby was born at 10 a.m. of the census 
day, one son is away at college in another state, a daughter is 
boarding and rooming in a neighboring town, where she teaches 
school, and the father is in jail for evading income taxes. On 
several points you would feel the need for very specific instruc¬ 
tions. To avoid double counting or the failure to count many 
individuals, instructions to the enumerators must be given with 
.great care; every possible complication must be foreseen. 

In recording facts about manufacturing and trading, or 
merchandising, enterprises in separate categories, when is an 
enterprise a manufacturing concern and when is it a trading, 
or merchandising, concern? In recording statistics about farms, 
when is a farm not a farm but a truck garden? These few 
examples are probably sufficient to emphasize the point that the 
unit must be carefully defined and that the defined unit must be 
strictly followed and freely or even religiously disclosed to all 
who in the future use the statistics. 

Carefully planned schedules of questions, often called “ques¬ 
tionnaires,” are the principal means of gathering statistics. 
These vary from schedules simple enough for oral presentation, 
as frequently utilized in polls, to the elaborate forms used by the 
government or research organizations. In the first phase of 
statistical investigation, the gathering of facts, care in following 
all the rules of common sense and logical definition is epitomized 
in the formulation of the questionnaire, or schedule. 
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QUESTIONNAIRES OR SCHEDULES 

Official Example of Care in Description of Units, In taking the 
United States census, for example, the assurance of accuracy in 
regard to these important but detailed matters is guarded by a 
skillfully organized system. Forms are supplied, with column 
arrangement for writing in all the required information. A 
question appears at the head of each column, and the columns, 
and therefore the questions, are grouped into subjects; thus in the 
schedule for the 1940 population census there are 34 columns 
grouped under the subjects location, household data, name, rela¬ 
tion, personal description, education, place of birth, citizenship, 
residence, and employment status. In addition, columns 35 to 50 
contain supplementary questions to be asked only a sample of all 
the persons enumerated. Figures 1 to 3 show in three sections 
the 34 questions asked of all pei'sons. Figure 4 is included to 
show the questions on employment status in the 1930 census, 
revealing thereby the great elaboration of this type of question in 
the 1940 census. 

Sample forms that had been filled in mth illustrative answers 
were supplied to enumerators, and a complete, simple description 
of the manner in which the form was to be filled out was printed on 
the sample schedule. Pamphlets were issued for the use of 
enumerators, giving detailed instructions. For the 1940 census 
there were issued to enumerators taking the census of population 
and agriculture a printed and indexed pamphlet 173 pages long. 
This gave detailed definitions and described procedure for enu¬ 
merators to follow under the various circumstances that might 
arise in their house-to-house canvasses. 

Moreover, the enumerators worked directly under experienced 
district supervisors, who, in turn, were under area managers 
responsible to the Bureau of the Census in Washington. To train 
the 529 district supervisors in the 1940 census taking and census 
procedures, a picked group of more than one hundred men from all 
parts of the country were given a special course of instructions in 
Washington. Those who passed the examination were sent out as 
area managers to the 104 census areas, each to direct the training 
of five to seyen district supervisors and to act as regional manager 
between them and the Bureau of Census in Washington. 

The 529 districts were broken into enumeration districts of 
which there were about 147,000. Generally speaking, there was 
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Fig. 1.—The first 12 questions on the 1940 census schedule grouped under topics location, household data, 
name, relation, and personal description. Notice the sample entries. 
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Fia. 2.—Questions 13 to 20 on the 1940 census schedule grouped under topics education, place of birth, 

citizenship, and residence. 






























Fig. 3.—Questions 21 to 34 on the 1940 census schedule grouped under topic employment status of person fourteen years old and 
over, with subgrouping indicated in captions of columns. Notice the sample entries. 
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Fig. 4.—Questions in the 1930 census schedule on employment status under topics occupation, industry, and 
employment. In the 1940 census schedule these questions were placed among the supplementary ones. 
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one enumerator in each of these districts, but in certain regions 
one enumerator covered more than one district. Therefore, 
about 123,000 enumerators were used. Wide publicity for such 
careful preparation in the case of the 1940 census resulted from 
Congressional protests about some of the new questions. ^ 

To illustrate the necessity for careful definition of units and 
description of procedure and to solve the census problem of the 
amazing family described above, the following is quoted* from the 
Instructions to Enumerators:^ 

Who Is to Be Enumerated in Your District 

300. The problem of who is to be enumerated in your district is 
extremely important. Therefore, study very carefully the following 
rules and instructions. 

301. Th£ Census Day, There should be a return on the population 
schedule for each person alive at the beginning of the census day, i.e,, 
12:01 A.M. on Apr. 1, 1940. Thus, persons who died after 12:01 a.m. 
should be enumerated; and infants born after 12:01 a.m. on Apr. 1,1940, 
should not be enumerated 

302. Usual Place of Residence, Enumerate every person at his 
“usual place of'residence.^’ This means, usually, the place that he 
would name in reply to the question “Where do you live?’' or the place 
that he regards as his home. As a rule, it will be the place where the 
person usually sleeps. 


Persons to Be Enumerated in Your District 

304. Enumerate all men, women, and children (including infants) 
whose usual place of residence is in your district or who, if temporarily in 
your district, have no usual place of residence elsewhere. Persons who 
move into your district after Apr. 1, 1940, for permanent residence 
should be enumerated by you, unless you find that they have already 
been enumerated in the district from which they came. 

305. Residents Absent at Time of Enumeration, Some persons having 
their usual place of residence in your district may be temporarily 
absent from the household at the time of the enumeration. These you 
must enumerate with the other members of the household, obtaining 
the information regarding them from their families, relatives, acquaint¬ 
ances, or other persons able to give it. However, do not include with 

^ The New York Times, Feb. 27-29, Mar. 1-3, 1940. 

2 Bureau of the Census, Instructions to Enumerators, Population and 
Agriculture, pp. 14r-18, 80-81. 
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the household a son or daughter permanently located elsewhere or 
regularly employed elsewhere and not sleeping at home. 

306, Persons to be counted as members of the household include the 
following: 

a. Members of the household temporarily absent at the time of the 
enumeration, either in foreign countries or elsewhere in the United 
States, on business or visiting. 

h. Members of the household attending schools or colleges located in 
other districts, except student nurses away from home and students in 
the Naval Academy at Annapolis or in the Military Academy at West 
Point or in any other training school or institution operated by the 
War or the Navy Departments or the United States Coast Guard. 

c. Members of the household who are in a hospital or a sanitarium 
but who are expected to return in a short period of time. 

d. Servants or other employees who live with the household or sleep 
in the same dwelling. 

e. Boarders or lodgers who sleep in the house. 

/. Members of the household enrolled in the Civilian Conservation 
Corps. 

307. In the great majority of cases the names of absent members will 
not be given to you by the persons furnishing the information, unless 
particular attention is called to them. Before finishing the enumera¬ 
tion of a household, therefore, you should ask the question: Are there 
any members of the household who are absent? 


Persons Not to Be Enumerated in Your District 

313. There will be a certain number of persons present, and perhaps 
lodging and sleeping in your district at the time of the enumeration, who 
do not have their usual place of residence there. As a rule, do not 
enumerate as residents of your district any of the following classes, 
except as provided in paragraph 314: 

a. Persons temporarily visiting with the household. If, however, 
they do not have any usual place of residence from which they will be 
reported, they should be enumerated with the household. 

h. Households temporarily in your district which have a usual place 
of residence elsewhere from which they will be reported. 

c. Transient boarders or lodgers who have some other usual or perma¬ 
nent place of residence, that is, who have a house or apartment else¬ 
where in which they usually reside and where they will be enumerated. 

d. Persons from abroad temporarily visiting or traveling in the United 
States and foreign persons employed in the diplomatic or consular servr 
ice of your country (see paragraph 331). (Enumerate other persons 



GATHERING STATISTICS 


37 


from abroad who are students in this country or who are employed here, 
however, even though they do not expect to remain here permanently.) 

e. Students or children living or boarding with this household in 
order to attend some school, college, or other educational institution in 
the locality but who have a usual place of residence elsewhere from which 
they will be reported. 

/. Persons who take their meals with the household but usually lodge 
or sleep elsewhere. 

g. Servants or other persons employed by the household but not 
sleeping in the same dwelling, 

h. Persons who were formerly members of this household but have 
since become inmates of a jail; or a mental institution, home for the 
aged, infirm, or needy, reformatory, prison, or any other institution in 
which the inmates may remain for long periods of time. 

i. Transient patients of hospitals or sanitariums. Such patients are 
to be enumerated as residents in the households of which they are mem¬ 
bers and not as residents in the institution, unless they have no other 
place of residence at which they will be reported. 

314. When to Make Exceptions, In deciding when to make excep¬ 
tions to the rules indicated above, consider whether the household or 
persons temporarily residing in your district will be reported at another 
place of residence by some person in a position to supply the information 
required. If the persons or household will not be so reported, enumerate 
them as residents of your district. 

Enumeration of Special Claeses of Persons 

315. You may experience some difficulty in determining whether to 
enumerate certain special classes of persons indicated below. In any 
instance in which you are not sure whether to include persons as resi¬ 
dents of your district, ask your squad leader or supervisor for further 
instructions. 

316. Servants. Enumerate with the household any servants, laborers, 
or other employees who live with the household and sleep in the same 
house or dwelling unit. However, enumerate servants who sleep in 
separate and completely detached dwellings as separate households, 
even though the dwelling is on land owned by members of the household 
by which the servants are employed., 


318. Students at School or College. If there is a school, college, or 
other educational institution in your district that has students from 
outside your district, enumerate as residents of the school only those 
students who have no usual places of residence elsewhere. Especially 
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in a university or professional school there will be a considerable num¬ 
ber of the older students who are not members of any household located 
elsewhere. Find and enumerate all such persons. 

319. Schoolteachers. Enumerate teachers in a school or college at 
the place where they live while engaged in teaching, even though they 
may spend the summer vacation at their parents' home or elsewhere. 

320. Student Nurses. Enumerate student nurses as residents of the 
hospital, nurses' home, or other place in which they live while they are 
receiving their training. 

321. Patients in Hospitals, Sanitariums, and Convalescent Homes. 
Most patients in hospitals, sanitariums, and convalescent homes are 
there temporarily and have some other usual place of residence. Enu¬ 
merate patients as residents of such an institution only if they have no 
other place of residence from which they will be reported. A list of 
persons having no permanent homes can usually be obtained from the 
institution records. 

322. Inmates of Prisons, Asylums, and Institutions Other than Hospi¬ 
tals. Your district may include a prison, reformatory, or jail, a home 
for orphans, for aged, infirm, or needy persons, for blind, deaf, or incura¬ 
ble persons, a soldiers' home, an asylum or hospital for the insane or the 
feeble-minded, or a similar institution in which the inmates usually 
remain for long periods of time. Enumerate all the inmates of such 
institutions at the institutions. Note that in the case of jails you must 
enumerate the prisoners there, however short the sentence. 


Census of Agriculture 
General Information 

Purpose of the Census of Agriculture. An act of Congress provides 
that a census of agriculture be taken every 5 years, for the purpose of 
obtaining basic information on farm acreage, land values, crops, live¬ 
stock, and other general items relating to agriculture. The Sixteenth 
Census, which will be taken as of Apr. 1, 1940, will include compre¬ 
hensive information on agriculture, including irrigation and drainage of 
farm land. 

Every enumerator must fill out a farm and ranch schedule for each 
tract of land in his enumeration district that might classify as a ^^farrn" 
under the census classification, giving all the requested information. 
The information should be obtained by a personal visit of the enumer¬ 
ator. It is absolutely necessary that the census be complete and accu¬ 
rate. Census data are widely used by both private and public agepcies 
and often form the basis for legislative and administrative programs. 



GATHERING STATISTICS 


39 


The farmer should be made to feel that his contribution to the census 
is of real value to himself and to his community. 

Census Schedules Are Confidential. The Federal law providing for the 
census prescribes heavy penalties for revealing information to unauthor¬ 
ized persons. The enumerator should make it clear,^ in dealing with 
persons who seem unwilling to give the information requested, that he 
is not allowed to give any information to their neighbors or other 
persons; that only sworn census employees will have access to the farm 
‘schedules; and that those records for individual farms cannot be used 
for purposes of taxation, regulation, or investigation. 


Definition of a Farm. The definition of a farm found on the face of 
the schedule must be carefully studied by the enumerator. Note that 
for tracts of land of 3 acres or more the $250 limitation for value of 
agricultural products does not apply. Such tracts, however, must have 
had some agricultural operations performed in 1939 or contemplated in 
1940. A schedule must be prepared for each farm, ranch, or other 
establishment that meets the requirements set up in the definition. A 
schedule must be filled out for all tracts of land on which some agri¬ 
cultural operations were performed in 1939 or are contemplated in 1940 
and which might possibly meet the minimum requirement of a ‘^farm.’’ 
When in doubt, always make out a schedule. 


You now have instructions that will help enumerate the inter¬ 
esting family first encountered above—the mother will be enu¬ 
merated (paragraph 321), the baby will not be enumerated 
(paragraph 301), the son mil be enumerated (paragraph 318), the 
daughter mil not be enumerated (paragraph 319), and the father 
will not be enumerated at the household, although if the jail is in 
town he will there be enumerated (paragraph 322). 

Figures 5 and 6 are photographic reproductions of parts of the 
Farm and Ranch Schedule used for the census of agriculture. On 
Fig. 6 appears the definition of a farm to which reference is made 
in the general information quoted above from the manual of 
instructions. Altogether the farm and ranch schedule contains 
232 questions on 16 subjects. The subjects include information 
about the operator, farm acreage, values, farm mortgage and 
taxes, irrigation, cooperative selling and purchasing, farm labor, 
farm expenditures in 1939, farm machinery and facilities, live- 
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Fig. 5. A portion of the Farm and Ranch.Schedule to illustrate the type of questions included. 
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stock and livestock products, crops harvested on the farm in 1939, 
and value of products used and of forest products sold in 1939. 

Only trained enumerators can successfully use such elaborate 
questionnaires as those illustrated; only when properly instructed 
can enumerators know how to get the information requested in 
each question of these complex schedules. In some cases, the 
questionnaires, or schedules of questions, are tried out by a 
person-to-person call at the sources of information in advance of 
collecting the data for the final enumeration. There was during 
the summer of 1939 a trial census, covering a sample area in 
Indiana, taken by the United States Bureau of the Census while 
formulating the new and more complicated 1940 census schedule. 

Statistics Obtained from Samples. Every twentieth person 
enumerated on the 1940 census was asked supplementary ques¬ 
tions. The results constitute a 5 per cent sample. For the 
sample of population, the following subjects were covered: the 
usual occupation, industry, and worker class as a supplement to 
information obtained concerning present occupation, in order to 
determine the availability of and shifts to various kinds of labor; 
whether the respondent has a Federal social-security account 
number and whether wage deductions have been made for Federal 
old-age insurance during the 12 months ending Dec. 31, 1939; 
data showing the number of children bom to women who are or 
have been married (women married, widowed, or divorced), to 
make studies of differential fertility; mother tongue, or native 
language, obtained by a question asking what language was 
spoken in the home in earliest childhood; the status of veterans of 
foreign wars and their wives, widows, and children; and informa¬ 
tion concerning the place of birth of the father and the mother of 
the respondents. 

This is the first decennial census in which the sampling process 
has been applied, and the results of the experiment are eagerly 
awaited by statisticians everywhere. While the decennial census 
has always been presumably a complete enumeration, other gov¬ 
ernmental statistics have frequently been drawn from samples. 
Indeed, because of limited funds, it is necessary for the Bureau of 
Labor Statistics to resort to the sampling ipethod to obtain data 
on wages and hours in industry. Preliminary to the collection of 
such data the census data for the industry are studied to deter¬ 
mine in which states the industry is of material importance. 
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Fig. 7. —First portion of questionnaire used by the Bureau of Labor Statistics to obtain data on 

union wages and hours. 
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Manufacturers’ directories are examined and books and periodi¬ 
cals relating to the industry are read, thus obtaining the essential 
a priori background of knowledge to form the basis for sound 
proportional sampling. ^ 

To obtain the data on wages and hours of labor, the Bureau of 
Labor Statistics uses carefully prepared and elaborate question¬ 
naires, one of which is illustrated in Figs.* 7 and 8. Trained 
agents obtain the information from a responsible official of each 
local union. Each scale of wages and hours is verified by the 
union official interviewed and is further checked by comparison 
with the written agreements when copies are available. For 
example, in the building-trades survey for June 1, 1939, inter¬ 
views were obtained with 1,551 union representatives, and 2,729 
quotations of scales were received. The union membership 
covered by these contractual scales of wages and hours was 
approximately 444,000. Great care is exercised to see that the 
agents are adequately trained to collect the data; written instruc¬ 
tions are supplied them by the Bureau of Labor Statistics, 
in which they are cautioned as follows 

In the final analysis the accuracy and value of the entire survey must 
rest upon the agents who collect the data. These data must be abso¬ 
lutely correct and so presented on the schedule as not to be confusing 
or ambiguous. Each agent is, therefore, requested to study thoroughly 
the instructions, not once but repeatedly, and to question any point 
therein which may not be perfectly clear. It is extremely important 
that the agent check every schedule carefully before mailing to the office 
to be sure that each item is correctly entered and explained. When 
agreements accompany the schedules, the agent must compare each 
quotation with the provisions in the agreement and must explain any 
differences. 

In order to ensure the collection of comparable data from all 
agents, the instructions give painstaking definitions of union 
scale,” “collective agreement,” “apprentices” and “foremen,” 

^ For further details of the methods employed by the Bureau of Labor 
Statistics, see ^‘Methods of Procuring and Computing Statistical Informa¬ 
tion of the Bureau of Labor Statistics (1923),” Bulletin 326; also ‘‘Union 
Scale of Wages and Hours in the Building Trades, ‘June 1, 1939,” Serial 
R1034, from Monthly Labor Review, November, 1939. 

2 Bureau of Labor Statistics, “Instructions for Survey of Union Scales of 
Wages and Hours, 1939” (No. 7468), p. 1. 
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13. Hav« there been any changes in rates aiiioe June 1,1939, or have any changes been agreed upon to become 
effective in the near futuret 

DaUegeOitt OeatpcHan Bote 



14. Agreements ate negotiated with: 

(a) Individual employers...— (6) Employera' assoeiation(s) ______ 

If (6), what proportion of union firms participate or are repreeented in the aasociation(s)T_ 


Fmr Building Trades Only 

IS. Does union cooperate with employer in establishing or enforcing standards of fair competition? 
YesO NoQ If there is a written code of fair competition please attach a copy. If oral or customary 
arrangements, please explain.. 



16 . What proportion of the !• and 2-family dwellings in this community are being built under unio n conditions 
so far as this trade is concerned?-- 


Does the unkm have a spedsl rate for this type of woih?' Tes □ No □ 
Explain.. 



Fig. 8. —Second portion of questionnaire used by the Bureau of Labor Statistics 
to obtain data on union wages and hours. 


























































Tni. MONEY EARNINGS OF FAftULY FROM EMPLOYMENT OR BUSINESS OUTSIDE OF HOME OR AT HOME 
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Fia. 9.—Questionnaire used for consumers’ purchases study. 
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“union rates” and “actual rates,” “union rates” and “prevailing 
rates,” and “averages.” 

Study of Family Income and Expenditures, In 1929 the Social 
Science Research Council suggested the advantages of conducting 
a study of consumption in such a way that the sample would cover 
a wide range of incomes, all types of natural families, and all 
occupations within representative communities of different 
sizes. Income data and certain other facts would be collected 
from all families visited, through the use of a short schedule. 
These data would provide the basis for selection of an adequate 
number of families in each income class to furnish more careful 
estimates of income and the details of expenditures. Following 
these suggestions, the National Resources Committee and the 
Bureau of Home Economics of the United States Department of 
Agriculture completed in 1939 a study of family income aixd 
expenditures. Figure 9 shows the questionnaire used.^ Tables 
of data based upon this questionnaire are shown in Chap. IV. 
It may be noticed that the type of question and indeed the whole 
schedule are much less complex, involving much simpler units, 
than any thus far illustrated. It was necessary for this schedule 
to be simpler than those discussed above because for the con¬ 
sumer-income study the agents were not so well trained as are, for 
example, the regularly employed field agents of the Bureau of 
Labor Statistics. 

Mailed Questionnaires, In some cases, especially where the 
schedule of questions is comparatively simple, questionnaires 
are sent through the mail to the sources of information. Such a 
method may be used either where the units involved are very 
simple or where those who are filling out the questionnaires are 
known to be qualified to do so. The United States Bureau of the 
Census and the Bureau of Labor Statistics have been able to 
use this method to obtain certain types of information from 
manufacturing concerns regarding employment, pay rolls, manu¬ 
facturing output, labor turnover, and the like. The method 
appears to be most used where fairly simple facts are collected at 
regular intervals. Data on pay rolls and employment are 

^ Bureau of Home Economics, U.S. Department of Agriculture, “ Con¬ 
sumer Purchases Study,'' Part I, Family Income, Miscellaneous Publication 
339, pp. 338-“339; c/. National Resources Committee, Consumers^ Incomes in 
the United States (1938), p. 49. 
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obtained by mailed questionnaires monthly by the Bureau of 
Labor Statistics from representative manufacturing establish¬ 
ments in 90 manufacturing industries.^ Figure 10is an illustra¬ 
tion of the type of letter used by such agencies to secure the good 
vnll and cooperation of businessmen. ^ 

Where the questionnaire-by-mail method is used, the returns 
must be carefully edited and subsequent correspondence is 
frequently required to correct mistakes made on the returns. 
Manufacturing and merchandising concerns in this country have 
become trained in the matter of filling out questionnaires for the 
government through years of practice so that there has been built 
up a cooperative enterprise between the government and business 
in the gathering of business statistics. Although sometimes feel¬ 
ing the heavy burden of filling out numerous forms of this type, 
business is nevertheless glad to cooperate because it is eager to 
see each month the compilation of business data that emanates 
from government sources. 

Income-tax returns are of the nature of questionnaires and 
are a source of many important statistics. Everyone is familiar 
with the care necessary in the examination of the units involved; 
everyone who has had to handle a return or listen to the head of 
the famil)» talk about it knows how detailed and specific are the 
printed instructions accompanying each form on which the return 
is made. In the case of the income-tax return, which frequently 
becomes so complicated as to require legal advice and expert 
accountants, the penalty for failure to file a return is sufficient 
to supply any incentive needed to overcome all obstacles. For 
failure to supply information for the other types of questionnaire 
that have been discussed, with the exception of the census, there 
is no similar penalty—^the business concerns fill out such ques¬ 
tionnaires in a spirit of jjublic service and to obtain the resulting 
compilations of data. 

Rules for Constructing Questionnaires. Any investigator who 
is tempted to seek information by the questionnaire method 
will be well advised to spend considerable effort first, to make 
certain that the facts are not already available, and then to 

1 Bureau of Labor Statistics, Employment and Pay Rolls,’’ Serial R1052, 
November, 1939, pp. 7, 11, and 16. 

* This letter was used in January, 1940, with a new questionnaire revised 
to obtain better monthly data on labor turnover. 
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U.S. DEPARTMENT OF UBOR 
■UREAU or uam mnsncs 

WARHiNOTON 


Dear Siri 

In response to numerous requests for more detailed informa¬ 
tion regarding labor turn-over in manufacturing industries, emd in order 
that the monthly reports published may become of greater value to you 
and others mho cooperate with us, the Bureau of Labor Statistics has 
revised the fozm used in collecting tumr-over data. 

I am sure you will agree that one of the most important re¬ 
visions of this form is the separation of accessions into two groups: 
the first, to show the number of workers rehlred after a separation of 
three months or less; the second, to Include all other workers hired. 

The purpose of this breakdown, of course, is to determine whether or 
not accessions occur in connection with new job opportunities or 
whether they are the result of temporary suspensions. 

We have also provided space to report separately changes in 
clerical, sales and supervisory personnel, so as to permit tabulations 
for the turn-over of these employees. If it is difficult or impossible 
for you to report this information separately, we shall apinreciate the 
data either for the total of all employees or, if this too is not 
feasible, for plant employees only, 

« 

We are enclosing copies of the revised forms for January. I 
want to assure you that the data which you furnish will be kept strictly 
confidential and will be used in such a way as not to reveal the identity 
of any individual firm. 

We sincerely hope that our labor tom-over reports based on 
this new procedure will be more useful and valuable to you, and we 
shall greatly appreciate your continued cooperation. 

Very truly yours, 

Isador Lubln, 

Commissioner of Labor Statistics 


Enclosures 


( 8678 ) 


Fig. 10.—A typical letter from the Bureau of Labor Statistics seeking to secure 
the good will and cooperation of businessmen in the reporting of statistics. 
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investigate well the pitfalls of questionnaire making, which is 
a highly specialized art. There are six fundamental but simple 
rules to be followed: 

1. The interest of the recipients of the questionnaires must be 
aroused or their cooperation obtained through some means. 
This may be done by engaging the support of some organization 
with which the individual informants are associated. For exam¬ 
ple, if the questionnaire is to go to bankers, the support or endorse¬ 
ment of the American Bankers Association should be enlisted. 
Interest in the questionnaire may also be aroused by the promise 
to furnish free copies of the summarized information when com¬ 
piled. In this manner and by the promise of secrecy regarding 
individual returns, various governmental units obtain great 
quantities of statistical information. 

2. The questionnaire should be as short as possible, consistent 
with the scope of information sought; and the individual ques¬ 
tions should be so formulated as to be free of all ambiguity. 
They should be simple. Avoid presenting problems'^ that will 
puzzle the recipients of questionnaires. 

3. Where possible, arrange the individual questions so that 
replies can be brief and unequivocal. Yes or no or perhaps 
merely a check mark is the ideal answer. 

4. The letter transmitting the questionnaire should be 
brief and dignified and yet should ‘^selF' the idea to the 
informants. 

5. After all is prepared, try out the questionnaire along with 
the transmittal letter on a dozen or so of the potential question¬ 
naire recipients in order to make final revisions before printing 
the questionnaires, or sch^ules. 

6. Always include a self-addressed stamped return envelope. 

The first five rules apply whether the questionnaire is to be 

used by trained enumerators or to be sent by mail, but special 
care must be exercised if sent by mail. Study of Fig. 9 will 
reveal that answers to all questions are quite simple, in some 
cases merely a check mark (see questions VI, 1, 2, 4, and VII), 
in other cases the entry of a familiar numerical item. Less 
highly trained enumerators are required for handling such a 
questionnaire than are required for handling the United States 
census schedules. 
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EDITING 

When the questionnaire is received from the agent or from the 
respondent by mail, it must be examined. If any statement on 
the schedule conflicts with other statements or if the schedule 
is incomplete or lacks clearness, it may have to be returned to 
the agent or respondent for explanation or revision. This is 
called editingthe returns or the schedules. In any case, a 
certain amount of editing must always be done before tabulation 
of the data is begun. When trained visiting enumerators have 
been used in the survey, there will, of course, be a minimum 
of mistakes. When the questionnaires have been filled out by 
the informant directly, it may be necessary to write for further 
information or for corrections because of inadvertent mistakes 
in replies. If the respondents have been interested sufficiently 
to return the questionnaire with answers filled in, they will 
probably be willing to answer further simple questions to eluci¬ 
date their former replies. If it is believed that the information 
has been deliberately falsified or withheld, it may be necessary 
to discard the entire schedule or at least the replies in it that 
seem to be of doubtful truth. 

Editing the schedules is the process of preparing the original 
statements in the schedule for classification, coding, and tabula¬ 
tion. Careful editing is necessary in order to obtain compilations 
of data that will truly reflect the conditions being investigated. 
One task of editing is to see that all figures entered on the return 
are clear. If not, the editor rewrites the figures. If so poorly 
written that even the editor cannot read them, the schedule 
must be abandoned or the information obtained by further 
correspondence. If the editing is done locally, many of these 
difficulties may be eliminated by telephoning. 

The principal task in editing is to locate all incomplete, incon¬ 
sistent, or improbable and impossible answers. When such 
answers are found, it is necessary either to discard the defective 
schedules or to obtain correct replies through further inquiry. 
This does not, of course, imply the elimination of ‘‘unexpected^’ 
replies. An incomplete answer, for example, would be if pneu¬ 
monia is given as the cause of death; it is necessary to know 
whether it is bronchial or lobar pneumonia. An inconsistent 
answer, for example, would be if a return shows a person widowed 
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when from his age it is clear that he never could have been 
married. If a person who is a male is reported having died of a 
disease that is known to occur only in females, this is an impos¬ 
sible answer. There is somewhat less distinct a line between 
improbable and simple unexpected replies. 

Only after incomplete, inconsistent, or improbable and impos¬ 
sible replies have been completed or corrected and all unclear 
figures carefully clarified are the schedules ready for coding, 
classification, and tabulation. For elaborate undertakings like 
the census, instructions are printed not only for the guidance 
of the enumerators btlt also for the editing and coding of the 
returns. For example, it is pointed out that the examination 
for completeness and consistency should be made family by 
family and not line by line. It will be^easier to follow the entries 
belonging to the family if a strip of cardt^rd is placed across 
the schedule just under the line containing the entries for the 
last member of the family.^ The coding and editing instructions 
say that all corrections and code figures entered on the schedule 
by the coding clerks should be made mth red ink ^nd a medium- 
point pen (neither a stub nor an extremely fine pen). Such a 
detailed instruction as this is necessary in order to secure uni¬ 
formity and when tabulation is undertaken mil enormously 
facilitate the work of the card-punching operators. 

CODING 

Whether or not machine tabulation is used, the coding of the 
schedules is a measure for economizing time. When large 
amounts of data are involved, consistent classification is enor¬ 
mously simplified by the use of code numbers. In arranging 
data it is then necessary only to observe a code number con¬ 
spicuously and uniformly placed on the return instead pf reading 
a title and remembering to what class that title bdorigs. On 
a Works Progress Administration project, to construct indexes of 
manufacturing employment and pay rolls in the state of New 
Jersey, 1923-1940, it was not possible to obtain the use of tabu¬ 
lating machines. It was found necessary, nevertheless, to use a 
carefully worked out coding procedure to avoid hopeless con¬ 
fusion in the handling of the data, which came monthly from 

^ Cf, United States Bureau of the Census, Instruction Manuals on Coding^ 
passim. 
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several hundred reporting firms. When machine tabulation is 
used, the coding procedure is a necessary step; it will be noticed 
that on the schedules (see Figs. 1 to 8) columns are inserted for 
the code numbers or letters to represent the various types of 
information on the schedule. 

An Illustration of Coding. In the 1939 census of manufac¬ 
tures, the manufacturing industries in the United States were 
grouped into 20 groups, each with a number. Food and kindred 
products constitute group 1; its code number is 100. Lumber 
and timber basic products form group 5; its code number is 500. 
Chemicals and allied products are group 9; its code number is 900. 
All subgroups of industries in the food and kindred products 
classification have code numbers in the lOO’s; for example, 
beverages are numbered in the 180's—nonalcoholic beverages 
is 181; malt liquors 182, wines 184, and so on. Grain-mill 
products are numbered in the 140's—flour and other grain-mill 
products is 141, cereal preparations 143, rice cleaning and 
polishing 144, and so on. Confectionery and related products 
are numbered in the 170's—chocolate and cocoa products is 172, 
chewing gum 173, and so on. Similarly, subgroups of industries 
in the chemicals and allied products classification have code 
numbers in the 900's; for example, industrial-chemical industries 
are numbered in the 980's—plastic materials is 982, explosives 
9^, coal-tar products, crude and intermediate, 981, and so on,^ 

The classifications adopted by the United States Bureau of the 
Census for the 1939 census of manufactures follow closely the 
suggestions made by the Technical Subcommittee on Industrial 
Classification composed of representatives of various govern¬ 
ment agencies. 2 The suggested classification of this subcom¬ 
mittee, designated the Standard Industrial Classification Code, 
was made according to the following principles:*^ 

1. The classification should conform to the existing structure 
of American industry. 

^ United States Bureau of the Census, Industry Classifications for the 
Census of Manufactures, 1939,^' Form 75. 

* Members of the subcommittee included representatives of the Depart¬ 
ment of Labor and Industry of New York State, the Federal Social Security 
Board, the Bureau of Internal Revenue, the Bureau of Labor Statistics, the 
Bureau of the Census, the United States Employment Service, and the 
Central Statistical Board. 

® Central Statistical Board, May 10, 1938 
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2. The reporting units to be classified are establishments. 
(An establishment is defined as a place of business. All persons 
working at the same location or place of business are classified 
in the same industry.) 

3. Each establishment is to be classified according to its 
major activity. 

4. Each industry group established must have significance 
from the standpoint of the number of establishments and 
employees involved, volume of business, employment and pay¬ 
roll fluctuations, and other important economic features. 

TABULATION 

When the schedules have been edited and coded they are 
ready for the operations of the card-punch machines, and the 
final machine tabulations are made from these punched cards. 
The information on each schedule is transferred in code to the 
punch cards. With a machine resembling a toy typewriter, 
operators punch holes or combinations of holes in the cards so 
that the electrically operated machinery for sorting and tabulating 
can automatically transfer the information to totals by any 
classification desired. The punch card somewhat resembles 
the music roll of an old-time player piano, and most of the 
operations through which it goes are mechanical and electrical. 

The 1930 census required the punching of 326,635,219 cards, 
which required an additional handling for verification. These 
cards represented 2,000,000 pounds of paper and would make a 
belt reaching nearly twice around the world at the equator. 
Punching, tabulating, and related work were equivalent to the 
handling of 4,701,671,697 cards once. 

The Bureau of the Census .has its own unit tabulating equip¬ 
ment. Some of these machines can digest 400 cards a minute. 
The unit machines were invented and developed within the 
Bureau by Herman Hollerith, who was employed in the Bureau 
and invented the first machine to tabulate the 1890 census. 
He is now known as the ‘‘father of machine tabulation,'' used 
throughout the world by governments and business to handle 
large statistical jobs. 



CHAPTER III 


SOURCES OF STATISTICS 

Primary and Secondary Sources. The original collector of 
data is their primary source. Generally speaking, data obtained 
from a primary source inspire greater confidence than the same 
data taken from a secondary source. The primary source is 
presumably the one sure place to find the exact definition of the 
units of observation involved. Subsequent reproductions of 
the data may fail to reproduce this essential information and 
lead to a misunderstanding of the true meaning of the data. 

The United States Bureau of the Census is the primary source 
of population data, of census data in general, and of all the 
statistical data published by the United States Department of 
Commerce, for the Bureau of the Census is the data-gathering 
agency of the Department. The Bureau of Foreign and Domestic 
Commerce, on the other hand, is a large retailer of statistical 
data gathered not only from the records of the Bureau of the 
Census but also from numerous other governmental and non¬ 
governmental sources. While governmental publications are 
thus not uniformly primary sources, they are usually very 
careful to give exact reference to the primary sources and to 
define units adequately. 

In some cases, secondary sources may be better than the 
primary. Such is the case when experts presumably better 
qualified than the general run of statistical researchers have 
selected the good statistics from the poor ones in some primary 
source that may be either obscure and difficult to obtain or of a 
highly technical nature. Occasionally a secondary source 
performs the valuable function of selecting data impartially 
from primary sources that are biased in one way or another. 
Sometimes it is necessary also to be on guard against bias in 
government sources.^ 

^ Hinricks, a. F., “Statistical Bias in Primary Data and Public Policy, 
Journal of the American Statistical Association, Vol. 33 (1938), pp. 143-152. 
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Natural Sciences. After the development of the statistical 
theories of gases (Charles’s law, Boyle’s law, Avogadro’s law, 
the work of Gay-Lussac, and the like) the physical sciences and 
arts accumulated source materials of a statistical character. 
Beginning with the last quarter of the nineteenth century, 
biology and zoology also accumulated source materials of a 
statistical character when a group of English biologists concluded 
that mass observation was necessary for the successful solution 
of their problems.^ 

Nongovernmental Sources. Statistical data of the natural 
sciences consist to a large extent of hypothetico-observational 
or experimental data. The principal sources of these data are 
handbooks of the special fields of study and monographs written 
by scholars at the great centers of research. For example, 
sources of astronomical data are the observatories located in 
various places throughout the world. The sources of currently 
discovered data in biology, physics, and chemistry are the 
laboratories maintained by universities, by private business 
enterprisers, or by such institutions as the Smithsonian Institu¬ 
tion at Washington, D.C. 

Additional primary sources of statistics in the natural sciences 
are the several hundred technical journals, publications of the 
learned societies, trade journals, publications of commercial 
research organizations, college bulletins, and the publications of 
endowed research enterprises. Fortunately for those who desire 
to make use of them, the data currently accumulated in such 
sources are summarized or abstracted in publications that main¬ 
tain sections of their respective issues for the purpose. ^ 

Statistical data for the natural sciences are also found in 
handbooks for the numerous special fields of study. For example, 
there are handbooks in medical entomology, physical therapy, 
geology, botany, experimental physics, and geophysics.^ 

^ Anderson, O. N., “Statistical Method,” Encyclopaedia of the Social 
Sciences. 

2 A partial list of such abstracting agencies is as follows: Science AhstraciSy 
Abstracts of Geology^ Abstracts of Bacteriology, Abstracts of Chemical 
Papers, Zentralblatt fur Mathematik, Jahrbuch uber die Fortschritte der 
Mathematik, Physikalische Berichte, and Biometrika. 

^ Handbook of PhysicahTherapy (1939); Handbiich der allgemeinen Chemie, 
urUer Mitwirkung vieler Fachlente (1918^1937>; Handbvch det^Experim^idlr 
physik (1929-1935), 43 vols.; Handbook for Chemistry and Physics. 
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Governmental Sources, Sources of data in the natural sciences 
are enormously supplemented by governmental agencies. The 
government weather bureau supplies current and historical 
data important to many kinds of research in such natural sciences 
as botany, zoology, and geology. The Minerals Yearhookj 
published by the United States Department of the Interior, is a 
source of data for natural scientists. The Geological Survey 
is a source not only of geological data but of data on electrical 
power production and other information useful to engineers. 
Engineers also find that government agencies are sources of 
statistics on railroads, flood control, roads, and other similar 
subjects having to do with construction. 

Biologists find the chief source of modern vitality statistics 
of all sorts among the publications of governmental agencies. 
An important source of statistical data for medical men results 
from medical research recorded in the files of hospitals, some of 
which are governmentally operated. 

The quantity of statistical data relating directly to the natural 
sciences is thus large, but the natural sciences in addition make 
extensive use of the highly organized mass of statistical data 
collected largely by social scientists. Scholars in the natural 
sciences frequently make use of statistics concerning social and 
economic events. It is not at all uncommon for data concerning 
the behavior of human beings to enter into the calculations of 
engineers, physicists, and chemists engaged in practical business 
enterprise or pure research. Some illustrations were given in 
Chap. I. 

Social Sciences. Genesis of Statistical Sources. The increasing 
complexity of economic and social life has furnished the motive 
for the systematic marshaling of statistical data about human 
society; and, in addition, the dynamic quality of modern life 
makes it necessary to repeat statistical enumeration frequently in 
order to have knowledge of current facts and, what may be 
more important, knowledge of change. In the static conditions 
of earlier times one public fact-gathering enterprise could serve 
for years as a basis for judgments and for political and social 
action. Under modern dynamic conditions, this is not the case. 

In a democracy the timing of goveI*nmental action is dependent 
on the consent of the people, and that requires widespread 
knowledge of many economic and social facts and their inter- 
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pretation. If democracy is to preserve its high standards of 
achievement, its powers of expression in the face of tremendous 
forces that appeal to sentiment rather tiian to reasoned judgment, 
adequate factual information must be in the hands of the voters 
and of their governmental administrators and representatives 
in time for necessary action. Modern business enterprisers, 
too, faced with rapidly changing conditions, must lean more 
and more heavily on statistics to point the way to the solution 
of their problems. 

During a great national crisis, such as a severe depression or a 
war, the value of statistical data is enormously enhanced. In 
depression periods, published statistical data from governmental 
sources, which in retrospect appear to have been but a trickle in 
prosperity, swell to flood proportions. Modem war, moreover, 
as well as being a war of supply,^^ a war of machines,'^ or a “ war 
of production,’’ is a ‘Svar of statistics.” The fact that much 
of the increasing wartime volume of statistical data is confidential 
explains the apparent and deceptive appearance of fewer sta¬ 
tistics in wartime than in peacetime. During the Second World 
War the statistics published by the United States Bureau of the 
Census, for example, sharply decreased because its organization 
and equipment were almost fully employed doing wartime sta¬ 
tistical work, especially for such agencies as the War Production 
Board and the Office of Price Administration. 

So diligent have been the efforts to obtain current knowledge by 
means of statistics during the past fifty years that a vast source of 
raw material now exists, covering many fields of knowledge. 
Elementary acquaintance with these sources is essential to all 
those who hope to work in either the natural or the social sciences. 
Complete familiarity with sources of statistics can come only with 
long practice in their use. It would be futile to attempt to impart 
to the student this desirable familiarity by giving a complete 
description of all sources. 

The Pattern of Statistical Sources. The student cannot hope to 
memorize the names of all sources of statistics; indeed, the 
attempt would not be useful, for the names change and new ones 
are added as time goes by. Comprehension of the pattern of 
development of statistical sources, however, will enable the stu¬ 
dent to become a scholar who, when confronted by a statistical 
problem, will have acquired a ‘‘statistical sense” that will guide 
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him to the appropriate sources. This presumption explains why 
the present chapter'on sources is given an historical or a genetic 
setting: Let the names of all the statistical agencies be changed 
by the Second World War, and the study of the historical and 
genetic explanation of statistical sources will still help the student 
acquire that scholarly ability required to locate sources; he will 
have historical perspective to facilitate his prompt understanding 
of the postwar world of statistical sources. In any case, the 
period between the First and the Second World War will long 
continue to be one intensively studied by statisticians of coming 
generations. 

In the ensuing description of sources of statistics, which is 
presented in its historical or genetic aspects, governmental 
sources are given more space than nongovernmental sources, 
because the general statistician deals mostly with the former. 
While the specialized statistician must acquire detailed knowledge 
of sources in his special field, he also needs to be familiar with 
governmental sources in his field. Governmental sources, more¬ 
over, are themselves one of the best guides to the successful use of 
nongovernmental sources, because many governmental agencies 
are secondary sources that give complete and very useful descrip¬ 
tions of the primary sources used. 

The motive underlying the gathering and publication of 
statistics by private enterprise has usually been the profit 
available through the sale of such statistical information to 
commercial, banking, and manufacturing or distributing enter¬ 
prises. In many instances these services emerged as incidental 
features of existing publications; an example is the increasing 
amount of statistics of all kinds published in newspapers and 
periodicals. In other instances the statistical feature was the 
original purpose of the publication; many trade journals are 
cases in point. 

The state and privately endowed universities of the nation are 
important sources of statistical research, especially of a pioneering 
character, in all branches of knowledge—some being famous for 
certain fields of statistical work. 

During recent years one of the most striking aspects of ‘‘big 
business’^ development has been the maintenance of research 
organizations contributing to statistical knowledge, a fact that the 
public was not permitted to forget as it visited the 1939 World's 
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in New York and read newspaper and magazine wartime 
advertisements in the early 1940's. Some corporation-financed 
research organizations, primarily intended for profit making, have 
incidentally contributed in important ways to the advancement of 
scientific statistics in engineering, business, and^the Use of agri¬ 
cultural products. Most of the pioneering statistical research 
in agriculture, however, as well as in labor organization, wages, 
and the like, is done by governmental units or by the govern- 
mentally sponsored agricultural experiment stations connected 
with various state colleges or universities. 

The motive underlying governmental activity in the collection 
and publication of statistics has been to increase knowledge of 
facts so that administrators may adjust government action to the 
changing needs of a dynamic society, so that democratic repre¬ 
sentatives of the people may legislate more expeditiously and 
Avisely, and so that the voters in a democracy may have the 
opportunity to know the facts. In recent years, a great expan¬ 
sion in the governmental activity of collecting and publishing 
statistics to aid business enterprise has occurred. In short, 
governmental statistical agencies assist both public and private 
economic planning. The large quantities of statistical informa¬ 
tion released by the Department of Labor and Department of 
Commerce are eagerly awaited by business enterprisers seeking to 
keep up to date in their methods, labor policies, coverage of 
potential markets, and knowledge of desirable sources of raw 
materials. True, their zeal in filling out the questionnaires that 
constitute the sources of the desired statistics sometimes falters, 
but on the whole businessmen recognize the truly cooperative 
character of the system of collecting and disseminating business 
statistics and stoically endure the barrage. 

As a consequence of the manner of their historical and genetic 
origin, therefore, modern statistical sources in the United States 
fit into a pattern that is more or less uniform among the various 
fields of knowledge. This pattern is roughly as follows: 

Research of private enterprisers: 

Individual enterprisers. Special monographs, articles, and other con¬ 
tributions are made by individuals and published under the sponsor¬ 
ship of universities, professional publications, and the like. 

Research associations. Quantities of statistical data are collected by 
research organizations, some of which are hired by corporate or 
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noncorporate “private enterpriser^ in the business world, some con¬ 
nected with universities, and some independently endowed. 

Commercial sources, i.e., privately financed publications: 

These sources are in the business of collecting and publishing statistics 
as a profit-making enterprise; they include 

Trade journals. 

Commercial and financial periodicals and services. 

Official publications of the government: 

Federal or national governmental agencies. 

Local, i.e.j state or municipal, governmental agencies. 

Guides to Sources of Statistics. If a trained professional 
librarian is available for consultation, he is the best informant on 
the subject of guides or handbooks to all general fields of research. 
However extensive may be the experience and training of the 
research scholar, he finds himself continually relying upon the 
local librarian, who makes a specialty of keeping posted on m^w 
developments with respect to handbooks and literary guides of all 
kinds. 

Guides to Nongovernmental Statistics, Practically every con¬ 
ceivable occurrence in the world of man or beast, in the heavens, 
on the ground, under the ground, on the sea, under the sea, or in 
astronomical space holds an interest for some individual or group 
of individuals; either as a hobby or as a means of livelihood some 
individual or group of individuals is now and has for many years 
been collecting statistical facts about all these world events. The 
existing sources of statistics necessarily therefore appear at first 
glance to be an unwieldy mass; but, fortunately both for begin¬ 
ners and for practiced scholars, this mass has been for some time 
culled over and classified, indexed and cross-indexed by various 
types of handbooks, yearbooks, or guides of one sort or another. 

The general magazine indexes constitute one class of such 
guides; the principal ones are as follows: 

Agricultural Index, 

Education Index. 

Engineering Index. 

Industrial Arts Index. 

Public Affairs In formation Service. 

Readers^ Guide to Periodical Literature. 

Such indexes or guides arc compiled monthly and cumulated 
into annual volumes, and articles of a statistical character appear- 
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iiig in a comprelieiisive variety of journals and trade magazines 
can be discovered by the intelligent use of these alphabetically 
arranged indexes. The above-listod indexes are not specifically 
organized as guides to statistical sources; their collective purpose 
is as broad in scope as all modern knowledge, but one of their 
varied uses is to serve as guides to sources of statistics. 

Indexes or handbooks specifically dedicated to serve as guides 
to sources of statistics do, however, exist in considerable number. 
In 1937 the United States Department of Commerce published 
‘^Sources of Current Trade Statistics’’ (Market Research Series 
13), which lists practically all current trade statistics by govern¬ 
mental and nongovernmental agencies; this handbook was 
designed for the use of manufacturers, distributors, financial 
institutions, advertising agencies, trade associations, bureaus of 
business research, and individuals engaged in research work. In 
1942 the United States Department of Commerce published a 
handbook entitled Trade and Professional Associations of the 
United States j which lists the sources of practically every conceiva¬ 
ble type of trade statistics compiled by nongovernmental agencies. 

In 1934 a scholarly attempt was made by Gerlof Verwey and 
D. C. Renooy to construct a manual of statistical sources under 
the title The Economist's Handbook; this book was published in 
Amsterdam, Holland, and a supplement appeared in 1937. It is 
a guide to statistical sources on economic subjects, covering 
Belgium, France, Germany, the Xetherlands, Switzerland, the 
United Kingdom, and the United States. In the United States, 
D. H. Davenport and F. V. Scott were authors in 1937 of An Index 
to Business IndexeSy a book containing information about the 
many indexes used in business, including the name of the compiler, 
description of the index, frequency of publication, period covered, 
and the name of the publication in which current data appear. 
In 1937 the Special Libraries Association published a handbook 
Guides to Business Facts and Figures in which Part III is an index 
of statistical sources of information. 

A multiple assortment of handbooks in various special fields 
serve as guides to statistics in each special field of knowledge, 
along with other purposes for which the handbooks are issued. 
For example. Management Handhooky Flitcrafty and Handbook 
of Accountants serve, in their respective fields, as guides to 
statistical sources. 
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Often the purpose of a handbook or index of sources of sta¬ 
tistics is served by one of the numerous abstracts of statistical 
data. The Statistical Abstract of the United States j published 
annually by the United States Department of Commerce, is 
itself a source of statistics, but it is also an index to sources 
because at the head of or in a footnote to each table of data it 
records the primary source from which the data are obtained. 
Similarly, the World Almanac, which for 58 years has been pub¬ 
lished by the New York World or the New York World-Telegram, is 
itself a source of statistics and also a guide to sources for the same 
reason. 

Guides to Governmental Statistics. Many of the handbooks 
serving as guides to statistical sources compiled by nongovern¬ 
mental agencies, include also in their alphabetical indexes a large 
range of governmental statistical sources as well; but there are a 
number of important handbooks specifically intended to serve as 
guides to the maze of governmental sources of statistics. The 
best-known and most comprehensive guide is the United States 
Government Manual, published by the government. In 1938 the 
Central Statistical Board (later the Division of Statistical Stand¬ 
ards of the Bureau of the Budget) published a Directory of Federal 
Statistical Agencies. The Central Statistical Board was organized 
in 1933 in order to find some means of coordinating the various 
types of Federal statistics.^ The business of the Central Sta¬ 
tistical Board was to serve as an agency for the reorganization in 
collection, tabulation, and use of Federal statistics. It was hoped 
such an agency could help solve the problem of overlapping in 
statistical function, which caused unnecessary burdens upon 
respondents to questionnaires and which also resulted in ineffi¬ 
ciency in the utilization of statistical information. 

In response to a request by the President in a letter of May 16, 
1938, the Central Statistical Board made a report on the question 
as to whether or not it is possible to reduce the amount of duplica¬ 
tion in statistical reports. The board concluded that much 
could be done in the way of coordinating the gathering, tabula- 

1 In the task of perfecting Federal sta^stics the government has received 
the advice of scientific professional associations: See American Statistical 
Association and the Social Science Research Council, Government Statistics: 
A Report of the Committee on Government Statistics and Information Services 
( 1937 ). 
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tion, and presentation of Federal statistics; by such coordination, 
comparability in definition would bring about a great improve¬ 
ment in the efficiency of data collected. With reference to the 
reduction in the amount erf duplication, however, the board 
concluded that a majority of the financial and pther statistical 
reports and returns made by the public to the Federal govern¬ 
ment are incidental to the administration of governmental 
functions; the statistics are a by-product of either administrative 
or control functions of the government. Consequently, the 
board recommended that the Federal statistical and reporting 
services should remain largely decentralized so that they may be 
associated with the respective governmental functions to which 
most of them specifically relate; but that there is a continuing 
need for a statistical coordinating agency, with a specially 
trained staff and with broad powers.^ One important result 
of the coordinating functions of the Central Statistical Board 
was the publication of a directory of federal statistical agencies, 
which has already been mentioned. 

A general guide to government publications, Anne Morris 
Boyd^s United States Government Publications (1941), serves 
incidentally as a guide to governmental sources of statistics. 
This book also gives an analytical picture of the character and 
scope of government publications. The same may be said 
regarding Laurence F. Schmeckebier^s Government Publications 
and Their Use (1939). 

RESEARCH OF PRIVATE ENTERPRISERS 

Individual Enterprise. Pioneers. In spite of the fact that 
Domesday Book was an eleventh-century product and that even 
earlier examples of governmental collection of statistics can be 
cited, it remains true both historically and currently that the 
pioneer work of converting public records into statistics is non¬ 
governmental. The pioneers have been and are individuals. 
The father of modern vital statistics is John Graunt, who in the 
seventeenth century made statistical investigations that served 
as the basis for founding life insxirance. Another seventeenth- 
century scholar, Sir William Petty, was the outstanding pioneer 
in developing statistics for the social sciences. Both these 

^ Report of the Central Statistical Board, 76tli Congress, 1st Session, 
House Document 27, Jan. 10, 1939. 
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men were associated with the early development of the Royal 
Society of London, which was incorporated in 1662 and is the 
oldest of modern learned societies. 

Pioneering in the art as well as the science of statistics con¬ 
tinues in modern times to be highly individualistic. This is 
exemplified by the work of Karl Pearson in England^ and in the 
United States by such men as Wesley C. Mitchell and his works 
on index numbers and the business cycle, Warren Persons and 
his work on the statistical analysis of business statistics, and 
many others.^ Individual contributions are commonly presented 
in the publications of learned societies, such as the Journal of the 
Royal Statistical Society, the Journal of the Statistical Society of 
London (founded in 1834), and the Journal of the American Statis¬ 
tical Association (founded in 1839). These and the publications 
of other learned societies are indexed in the guides mentioned 
earlier in this chapter. 

Research Associations. During the 1920^s and 1930’s a num¬ 
ber of important research organizations in the field of economics 
and social institutions were organized. The Brookings Institu¬ 
tion in Washington, D.C., the Harvard Committee on Economic 
Research, the National Industrial Conference Board, the National 
Bureau of Economic Research, and the Cowles Commission 
were among the most prominent. 

The Harvard Committee on Economic Research was organized 
in 1919 to study business trends and cycles and to publish a 
scientific business forecaster; its work was launched under the 
leadership of Warren Persons. In addition to the forecasting 
service, this research organization publishes the Review of Eco¬ 
nomic Statistics -(quarterly) and once or twice a year a summary 
of statistics called the Statistical Record. 

The National Industrial Conference Board was organized by a 
group of comparatively public-spirited manufacturers to study 
the various problems of employer-employee relationships, leading 
them into special studies of real wages, income distribution, and 
general economic conditions. It publishes its studies in the 
form of books appearing as they are written. In addition to 
the subjects mentioned above there have been National Indus¬ 
trial Conference Board books on cost of living, statistics of income 

5 See Chaps. XIII and XIV. 

* See Chaps. XIX and XX. 
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by states, and availability of bank credit. The National Indus¬ 
trial Conference Board has also published since 1940 The Eco¬ 
nomic Almanac, which is a wdely used annual. 

The National Bureau of Economic Research was founded in 
1920, sponsored by a group who believed that a purely dis¬ 
interested approach is desirable and that no group should control 
the findings of this new statistical organization. It is so con¬ 
stituted as to produce this desirable result. A number of special 
studies of economic and social conditions have been made and 
published under its auspices and some in cooperation with the 
government. For several years it has occasionally issued 
bulletins containing data resulting from studies that usually 
appear later in more detail in book form. 

The nature and accomplishments of the National Bureau of 
Economic Research are indicated by the following quotation from 
the twentieth annual report of the director of research:^ 

The National Bureau was established by men who believed that it is 
becoming possible to apply quantitative methods to the study of eco¬ 
nomic behavior, "{hey realized that this field is far more difficult than 
the fields in which science has won its major triumphs and demonstrated 
its practical usefulness most conclusively. Also they recognized that 
investigators cannot experiment at will upon society; though society 
can and does experiment loosely upon itself. . . . Economics was not 
likely to grow faster at this turning point in its career than its elder 
sisters [the natural sciences]. But at the close of the First World War 
the materials for observing actual behavior were multiplying so rapidly 
and analytic methods of extracting significant conclusions were becoming 
so versatile and powerful that our founders thought their staff had good 
prospects of rendering valuable service at once. Also they hoped that 
one modest success would lead to others, fostering cumulative growth 
of the kind that has characterized systematic research in other fields.... 

Twenty years of effort along the lines laid down in 1920 have con¬ 
firmed our faith in the social value of what the National Bureau set out 
to do. Our accomplishments have not been spectacular, but they have 
been substantial, and they afford a secure foundation on which to build 
in future. We have more reason than ever to believe that in ttying^to 
establish a few economic fundamentals firmly we are aiding thoughtful 
men of all persuasions to plan wisely. If tested knowledge is the safest 
and surest guide in practical affairs, our work has social meaning, how- 

^ Mitchell, Wesley C., ‘^The National Burcau^s Social Function,'' 
March^ 1940, pp. 13-15, 19. 
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ever technical its character. . . . We hold that advance will be rapid 
and continuous in proportion as the workings of our economic system 
are understood. In trying to replace speculative opinions about eco¬ 
nomic relations by conclusions resting upon evidence we are expediting 
progress in the most effective manner we know. 

. . . Another device, peculiar to the National Bureau, is to select 
directors who have divergent views on public policy and give each an 
opportunity to criticize every manuscript. That device has been of 
inestimable help to us in keeping our reports nonpartisan and therefore 
worthy of credence by the public. Having such a board we cannot 
expect unanimous consent from its members to many policies that 
individuals among us favor. But the mere fact that the National 
Bureau never takes sides upon controversial issues adds its bit of pro¬ 
tection against bias in our publications and helps toward meriting and 
winning public confidence. 


The more thoughtful sections of the public we are now reaching in 
various ways. Physical scientists are coming to recognize the con¬ 
tributions of research in economics; for example, in I Believe Robert A. 
Millikan says 

economics and the social sciences long and elaborate statistical 
studies must be made in order to eliminate the disturbing factors and 
thus obtain the controlled conditions. We are just beginning to have 
available, through the National Bureau of Economic Research and other 
similar agencies, a large amount of such definite, dependable, statistical 
knowledge in economics.^’ 

The Twentieth Century Fund is another research association 
organized to function in a manner similar to that of the National 
Bureau of Economic Research. It publishes occasional pam¬ 
phlets or books. 

THE COMMERCIAL SOURCES 

In addition to the numerous sources of statistics resulting from 
individual or group research such as those described above, a 
great quantity of statistical sources has come into existence as 
the result of the activities of those who go into the business for 
the profit of collecting and selling statistical data. Such are the 
trade journals and the commercial and financial periodicals. 

Trade Journals. A large number of trade journals are actively 
engaged in colleeting statistical data for various types of enter¬ 
prise. The Iron Age^ for example, founded in 1856, is the trade 
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journal for the iron and steel industry, publishing statistics on 
iron and steel production in all states and the prices of iron, steel, 
copper, zinc, etc. Another example is Wileman’s Brazilian 
Review, which is the trade journal for coffee. The trade journals 
are frequently used by governmental statistical organizations, 
such as the Bureau of Labor Statistics, the Department of 
Commerce, and the Board of Governors of the Federal Reserve 
System, as the primary sources of particular data assembled 
by them. Occasionally the trade journals will publish in special 
pamphlet form or in books assembled data of the trade. 

During the 1920’s, a large expansion in the collection and 
publication of statistics in various lines of economic activity on 
physical commodity production and distribution took place. 
In a few instances this work was done by private companies. 
Thus Seidman and Seidman compiled data on furniture for the 
Grand Rapids district, and R. L. Polk and Company compiled 
data on new cars registered; the function of the latter was subse¬ 
quently taken over by Ward^s Automotive Reports. . Many 
such series were compiled by the trade journals from public 
records. The Iron Age compiled data on physical quantities 
of production of pig iron, and the Statistical Sugar Trade Journal 
published quantitative sugar statistics. 

Trade Associations. Most of the production and distribution 
series are compiled by the various trade associations, such 
as the American Face Brick Association (merged with the 
Structural Clay Products Institute), the American Paper and 
Pulp Association, and the United States Cane Sugar Refiners’ 
Association. 

The production and distribution statistical series are of various 
types. Some measure the flow of commodities through the 
process of production and distribution, for example, data on 
raw material received or consumed, like the figures on cotton 
consumption by textile mills or on cattle receipts at stockyards. 
Others give a measurement of quantity or stock of a commodity 
on hand. Still others are figures on the amount of orders or 
sales of the product, such as the unfilled orders of the United 
States Steel Corporation. As noted elsewhere, many of these 
series are collected from their original sources and published 
by the United States Department of Commerce in the Survey 
of Current Business.^ Consequently, the appendix of the Survey 
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contains a description of about every important commercial 
source of statistics. In fact, the Department of Commerce 
publishes a description of such statistical sources.^ Frequently 
a trade association will publish a sort of handbook or abstract 
of statistics for the trade, covering historical as well as current 
statistics.^ 

Commercial and Financial Publications. The commercial 
and financial journals and services are also too numerous to 
mention in detail, but a few may be described as typical. Among 
these are the Commercial and Financial Chronicle (weekly), the 
New York Journal of Commerce (daily), the Wall Street Journal^ 
(daily), BradstreeVs (merged in 1933 Avith Dun^s), Babson*s 
Reports, Moody's Investors' Service, Standard & Poor's Corpora¬ 
tion, Brookmire Economic Service, and the Dodge Statistical 
Service. 

While there is much overlapping of published commercial 
and financial statistics through these various publications and 
services, nevertheless each has become noted for especially good 
statistical service in a particular line. For example, the user of 
business-failure statistics thinks first of Bradstreet^s, because for 
many years the data that it has published on business failures 
have been Avidely used. Bradstreet’s was also famous for its 
index of wholesale prices for the United States, being a pioneer 
in the development and publication of such an index. Babson^s 
and Brookmire^s services are noted for business forecasting and 
for investment services and forecasting the stock market. The 
New York Journal of Commerce is noted for its current data on 
new securities issued and on the produce markets. The New 
York Times is noted for its index of business activity, which 
was published in the Annalist (weekly) until that periodical 
was discontinued. The Commercial and Financial Chronicle is 
particularly useful for its detailed array of current data on bank 
clearings, business failures, interest rates, stock and bond prices, 
corporations, capital stock and bond issues, and the money 
markets of the world. This remarkable publication can be 

1 Sources of Current Trade Statistics, Market Research Series 13 (1937). 

* United States Cane Sugar Refiners' Association, Sugar Economics, Sta¬ 
tistics, and Documents (1938). 

* Often referred to in footnotes as Dow, Jones and Company, which some¬ 
times mystifies beginners. 
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traced in its lineage back to 1820, when it started as Niles^ 
Weekly Register^ famous as an early preacher of the doctrines of 
high tariffs and the ^‘American system/’ From 1839 to 1865 it 
was called Hunfs Merchants Magazine. Since 1866 it has 
gone under its present name. The financial statements of all 
kinds of corporations, together with other statistics and corporate 
histories, are to be found in Moody’s Manual of Corporations. 

The Commodity Yearbook is published by the Commodity 
Research Bureau, New York, N.Y. This is a private organiza¬ 
tion devoted to the dissemination of accurate information on 
commodities and other related subjects, including production, 
consumption, prices, stocks, imports, exports, etc. Some are 
annual, some are monthly data. 

All the above-described sources are extensively used by 
American and foreign business enterprisers, whose subscriptions 
to them and advertising in them make possible the vast statistical 
undertakings on a profitable basis. The fact that they are so 
supported would seem to prove the value of statistics to modern 
business enterprise. 

OFFICIAL PUBLICATIONS OF THE GOVERNMENT 

Federal Statistical Agencies. Department of Commerce. The 
Department of Commerce is one of the greatest fact-gathering 
organizations of the Federal government, if not the greatest. 
It contains a number of bureaus chiefly engaged in the dissemina¬ 
tion of facts concerning not only commerce but economic and 
social life in general. The Bureau of the Census is the fact¬ 
gathering agency of the Department. 

The Articles of Confederation provided for the taking of a 
triennial census, but the Constitution of the United States 
provides for the taking of a population census every 10 years, 
to serve as the basis for Congressional apportionment. The 
first one was taken in 1790. The broad practical and scientific 
purposes that the census today serves were not in the minds of 
the American founders, and the earlier census publications were 
meager affairs compared with the modern census.^ The census 
of 1790, for example, returned the number of free white males 
over, and the number under, sixteen years of age, the number of 

^ CuMMiNQS, John, “Statistical Work of the Federal Government of the 
United States,^* in Koren, History of Statistics^ pp. 670-672. 
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free white females without distinction by age, all other free 
persons, and slaves—\vithout, in the case of the last two classes, 
distinction by either sex or age. The published census of 1790 
consisted of a volume of 52 pages. At the census of 1800 and of 
1810, five age classes Avere distinguished and the age classification 
was extended to white females. In addition, at the census of 
1810, some facts were compiled relating to manufacturing estab¬ 
lishments, their number, nature, extent, situation, and value. 
A digest of the results of these data was prepared by Tench 
Coxe and published in 233 pages. The census of 1820 introduced 
the idea of collecting occupational statistics, calling for enumera¬ 
tions of persons engaged in agriculture, commerce, and manu¬ 
factures. The census of 1830 returned to the original idea of 
obtaining merely a population enumeration; but in 1838 Presi¬ 
dent Van Buren suggested to Congress in his annual message 
that the census should be extended so as to include “authentic 
statistical returns of the great interests specially entrusted to or 
necessarily effected by the legislation of Congress.^'^ As a 
result. Congress provided in the act for the Sixth Census (1840) 
that the marshals should ^'return in statistical tables ... all 
such information in relation to mines, agriculture, commerce, 
manufactures, and schools, as Avill exhibit a full view of the 
pursuits, industry, education, and resources of the country.''^ 
Congress overreached the capacity of those entrusted with the 
task of census taking, for the census of 1840 is famous for its 
inaccuracies. At the census of 1850, improvements in the 
organization of collecting and compiling the statistics were 
made; and, according to Cummings, Avith the census of 1850 
the decennial enumeration began to assume modern proportions 
and character. 

One of the outstanding American economists of the nineteenth 
century, Francis A. Walker, Avas a pioneer in developing the 
census to what Ave understand it to be now. He did particularly 
notable work in perfecting the organization and presentation 
of statistical data in the Tenth Census (1880), of which he had 
charge. 

At the EleA’^enth Census (1890), machine tabulation Avas 
introduced (the Hollerith tabulating machines), at a great 

1 Ibid., p. 672. 

2 Ibid., pp. 672-675. 
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saving of time and expense. The printed reports of the census 
of 1890 aggregated 21,410 pages, in 25 quarto volumes, the final 
report being issued in 1897. The Bureau of the Census was 
established as a permanent one in 1902 and since that time has 
been in continuous operation as a great fact-gathering organiza¬ 
tion for the national government. The tendency since that 
time has been to confine the decennial census to the major 
subjects of population, manufacturing, agriculture, mines, and 
quarries, and in intervening years to take censuses of business. 
In intercensus years the Bureau also has charge of the annual 
collection of mortality data, statistics on religidus bodies, the 
collection and compilation of statistics of cotton and tobacco, 
and the annual compilation of statistics of cities of 30,000 popula¬ 
tion and over, and financial statistics of states.^ 

After 1902 the census of manufactures has been taken every 
5 years until 1919 and since 1919 every 2 years until 1939. The 
census of agriculture has been taken every 5 years since 1910. 
The Statistical Atlas (containing graphic illustrations of much of 
the census data) was first issued in 1874 [based on the Ninth 
Census (1870)] and has appeared irregularly since that date. 

In 1929 a census of distribution as well as of manufactures 
was taken; but when the National Recovery Administration 
began operations, many of the data assembled in the census 
year 1930 were out of date owing to the sharp business recession 
and the increase of unemployment following that year. Along 
with the regular biennial census of manufactures for 1933 the 
Bureau of the Census undertook an extensive census of business 
of types other than manufacturing, such as amusements, service 
businesses, barbershops, beauty parlors, repair shops, and tourist 
camps, covering more than 2,400,000 individual establishments. 

^ By order of the Secretary of Commerce, the collecting of financial sta¬ 
tistics of states was discontinued temporarily after the 1931 report. With 
no comparative basis provided by the statistics for smaller cities and no 
individual reports for states, the remaining reports were of greatly reduced 
value. A detailed analysis was therefore made of the needs for data in this 
field and of the Bureau^s past and present inquiries. Closely related reports 
were prepared for the director by the Central Statistical Board, the Advisory 
Committee to the Director of the Census, and the Municipal Finance 
Oflficers' Association of the United States and Canada. Accordingly, the 
Division of Financial Statistics of States and Cities was reorganized in 1936. 
Annual Report of the United States Bureau of the Census, 1937, pp. 23-24; 
1938, pp. 28-29. 
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For subsequent biennial dates the census of business was 
further developed. The census of business covering the calendar 
year 1935, for example, was much broader in scope than either 
the census of distribution of 1929 or the census of American 
business for 1933. The 1935 census of business attempted to 
obtain a reasonably complete picture of essential and com¬ 
parable items of business information concerning practically all 
lines of business activity in the United States. It comprised 
a complete census of retail and wholesale trade, service businesses, 
amusement enterprises, hotels, broadcasting stations, advertising 
agencies, banking, insurance, real estate, bus transportation, 
trucking, warehousing, construction, and distribution of manu¬ 
facturer’s sales through primary channels. 

Elaborate care was exercised in preparing the 17 schedules; 
before final use they were submitted for criticism to representa¬ 
tives of the business groups and governmental agencies prin¬ 
cipally concerned. Special efforts were made by the Bureau to 
integrate the census of business and the biennial census of 
manufactures by the adoption of common definitions, instruc¬ 
tions, area designations, and field procedures. In order to 
perfect procedure, conferences were held to discuss schedules, 
procedures, and other problems inherent in such an expanded 
business census. These conferences were attended by representa¬ 
tives of trade associations, professional groups, chain-store 
organizations, etc., and by official representatives of a number 
of governmental agencies—the Central Statistical Board, 
Interstate Commerce Commission, Bureau of Foreign and 
Domestic Commerce, Tariff Commission, Federal Reserve 
Board, and Bureau of Labor Statistics.^ 

The population schedule for the census of 1940 is notable 
for a number of new questions concerning employment status, 
migration, income status, housing, and education. It is also 
notable for the innovation of the sampling technique applied 
to one group of questions in order to widen the scope of the 
inquiries. It dropped the question on literacy. 

Employment and unemployment queries have been made in 
previous censuses, but the 1940 census made a new approach. 
The new data permit classification of the nation^s labor force 

^ Annual Report of the United States Bureau of the Census, 1936, pp. 
19-21. 
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into the employed, the unemployed who have had previous work 
experience, and the unemployed without previous work experi¬ 
ence—^new workers. They provide some measure of the volume 
of employment both during the whole year and during the week 
prior to the census day, Apr. 1, 1940. 

The schedule included questions that distinguish people at 
work, people unemployed who are seeking work, and people who 
have a job but are not at work because of temporary illness, 
industrial disputes, or vacations. Persons at work were asked to 
indicate the number of hours they worked during the week 
preceding the census, and the unemployed were asked to state 
the number of weeks they had been seeking work. Workers 
were classified as to whether they were in private industries or 
were employed by the government and whether they were own- 
account workers or unpaid family workers. 

The new inquiry on wages and salaries is important as a 
measure of national purchasing power and its distribution, and 
the resulting data have been helpful to business in indicating 
potential market areas. 

The net effects of internal population migration during the 
preceding 5 years were obtained by requesting the place of 
residence for each person as of Apr. 1, 1935. It is expected that 
compilation of the statistics comparing such residence with that 
of Apr. 1, 1940, which is also recorded on the schedule, will 
measure the effects of industry shifts, droughts, depressions, 
floods, the backflow west to east, and the shift from the city to 
the country, or vice versa. 

In 1940, for the first time, the decennial census included a 
separate housing schedule designed to give detailed information 
for each dwelling unit in the United States, whether occupied or 
vacant, rural or urban. Data were obtained as to the number of 
rooms, water supply, bath and toilet facilities, and light equip¬ 
ment. For each occupied unit or household, information was 
obtained concerning the principal means of refrigeration used, 
the presence or absence of a radio, the character of the heating 
equipment, and the principal heating and cooking fuels used. 
Each residential stmcture was described in respect to single, 
double, or multiple family occupancy, whether or not it contained 
a business unit, for what purpose and in what year it was orig¬ 
inally built, the principal exterior material of the structure, and 
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whether it was in need of ihajo^ repairs. The schedule include 
a question oh whether the family leases or o^vns, whether there 
is mortgage indebtedness, and methods of home finance. 

It is expected that the compilation of these data will provide 
valuable information on the latent purchasing power of a com¬ 
munity. There is no more important index of the social and 
economic status of a population than the standard of its housing. 
Housing experts believe that the information gathered will be of 
inestimable value in determining future housing policies. It 
will be of especial interest to manufacturers, builders, distributors, 
and bankers in their study of trends in home ownership and 
building in the United States. Cities will be able to determine 
the distribution of the various types of housing within their 
limits, together with the possible need of expansion of transporta¬ 
tion and communication systems, police and fire protection, 
schools, and similar facilities. Data shomng the equipment in 
houses, together with the state of repair of the homes, will be of 
value to manufacturers and distributors of housing products 
in the planning of their sales campaigns.^ 

The agricultural schedules for the census of 1940 likewise had 
a number of new features. Nine regional schedules, each used 
in a separate group of states, were especially designed to fit 
national variations in cropping practices. Questions designed 
to obtain subtotals for the value of various major categories of 
farm products sold or trad^ in 1939 made possible a much 
closer estimate of total farm income and of farm income by 
principal sources. The 1940 census also introduced a supple¬ 
mentary plantation schedule for use in the cotton belt that made 
possible a refined distinction between farms and plots cultivated 
by croppers and defined the exact status oTeach cropper and 
certain other tenants in relation to the plantation owner. Ques¬ 
tions to measure the effects of current agricultural policies were 
also asked, relating to soil-improvement crops, summer fallow, 
crop failure, and succession or interplanted double cropping. 

The Bureau of Foreign and Domestic Commerce is the great 
Federal fact analyzer and fact publisher in the Department of 
Commerce. It has a curious and rather eomplicaited history. 
From the beginning of the national period, the statistics of 


^ Cf, The New York TimeSj Jan. 24, 1940. 
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foreign commerce were linked up with our tariff policy and main¬ 
tained by the Treasury Department. In 1856, gromng out of 
an investigation of the tariff policies of other countries by the 
State Department, there was created a Bureau of Foreign Com¬ 
merce as a permanent bureau for the purposo of collecting 
statistics on foreign trade. In 1866, the Bureau of Statistics 
of the Treasury Department was created to take special charge 
of this work, and at the same time Congress gave it power to 
collect statistics on domestic trade as well as on foreign trade. 
In 1905, a Bureau of Manufactures in the Department of Com¬ 
merce was organized to foster, promote, and develop the various 
manufacturing industries of the United States, and markets for 
the same at home and abroad, .by gathering and publishing all 
available and useful information concerning industries and 
markets. 

As a consequence, there were bureaus in three separate depart¬ 
ments (Treasury, State, and Commerce) concerned with the 
gathering of foreign-trade statistics. In 1912, however, these 
functions were centralized in the Bureau of Foreign and Domestic 
Commerce of the Department of Commerce. 

The most important statistical publications of this bureau 
are the monthly Survey of Current Business (with a weekly sup¬ 
plement) and the annual Statistical Abstract of the United States. 
Special publications, designed to aid business are also prepared, 
for example, historical studies of industries, studies of the national 
income produced, and studies of market data.^ 

Other bureaus of the Department of Commerce are the 
Bureaus of Fisheries, of Patents, and of Navigation and Steam¬ 
boat Inspection, each of which publishes specialized statistics. 
The two great statistical organizations in the Department of 
Commerce, however, are the Bureau of the Census and the 
Bureau of Foreign and Domestic Commerce. 

Department of Labor. The United States Department of Labor 
also contains bureaus that publish statistics, the most important 

^ Illustrations are P. W. Barker, Rubber Industry of the United StateSj 
1839-1939 (1939); Division of Economic Research, National Income in the 
United States, 1929-35 (1936); B. P. Haynes an4<l. R. Smith, Consumer 
Market Data Handbook (1939). For other statistical publications of the 
Bureau of Foreign and Domestic Commerce, seq the United States Government 
Manual, 
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from the point of view of quantity of data compiled and pub¬ 
lished being the Bureau of Labor Statistics. This wa.s created 
in 1884 as the Bureau of Labor, although the Treasury Bureau 
of Statistics created in 18G6 had been enjoined to collect wage 
statistics. In 1888 the Bureau of Labor was made an inde¬ 
pendent Department of Labor. The duties of the Department 
of Labor were to acquire and diffuse among the people of the 
United States useful information on subjects connected with 
labor, in the most general and comprehensive sense, and espe¬ 
cially on its relation to capital, the hours of labor, the earnings 
of laboring men and women, and the means of promoting their 
material, social, intellectual, and moral prosperity. The com¬ 
missioner of labor in charge of^the Department was specially 
charged to investigate the causes of and facts relating to all 
controversies and disputes between employers and employees, 
and he was also empowered to make special studies of articles 
controlled by trusts and their effect on production and prices 
and other special subjects. Owing to the excellent work of the 
Department under the wise guidance of Carroll D. Wright, the 
first commissioner of labor, there is available a large mass of 
statistics in the field of labor for this country, including studies 
of strikes, the effect of the introduction of machinery on employ¬ 
ment and wages, the conditions of living and work of the laboring 
population, etc. Upon the basis of the wage and price data 
collected, index figures showing the trends of wages and prices, 
wholesale and retail, have been constructed and published by 
this bureau. 

In 1903, the old Department of Labor was transferred to the 
newly created Department of Commerce and Labor; but in 1913 
there was created a new Department of Labor, and in that 
department the Bureau of Labor Statistics. At the present 
time, the principal publications of the Bureau of Labor Statistics 
are the Monthly Labor Review (published since 1915), bulletins 
on special topics such as wholesale prices, retail prices, cost of 
living, wages, and labor turnover, and monthly serials to supple¬ 
ment the bulletins and give current information on those topics. 
Beginning in August, 1939, the Bureau of Labor Statistics pub¬ 
lished a daily index of 28 basic commodity prices at wholesale; 
but following the inauguration of wartime price controls this 
index was published only once a week since control in the raw- 
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material field was widely effective. During wartime the index 
was of little importance.^ 

Treasury Department. For the period before the Civil War 
the chief source of financial and price statistics in the United 
States, as well as data on governmental finance,'consists in the 
finance reports of the Secretary of the Treasury. 

Before the development of statistical bureaus in the Depart¬ 
ment of Commerce and the Department of Labor, the Treasury 
Department was the most important source of Federal statistics; 
and it is still important in the fields of banking and monetary 
statistics, owing to the work of the comptroller of the currency, 
and in the field of income and Federal taxation and indebtedness, 
owing to the work of the commissioner of internal revenue and 
the Secretary of the Treasury. 

From the United States Treasury Department comes the 
monthly Statement of the Public Debt of the United States. 
The commissioner of internal revenue of the Treasury publishes 
an annual report of income-tax returns, constituting the most 
important source of data regarding income statistics in the 
United States. The annual reports of the comptroller of the 
currency give financial and banking statistics and monetary 
data going back as far as the Civil War, when the national bank¬ 
ing system began. The comptroller publishes these data in 
an annual report and also several times a year in the Abstract of 
Condition of the National Banks.^ The annual reports of the 
director of the mint contain statistics on the production of the 
precious metals, including gold and silver. The Life Saving 
Service of the United States Treasury Department publishes 
data on marine accidents. 

Interior Department. The Department of the Interior has 
important statistical aspects, too. The Bureau of Mines pub¬ 
lishes data on fatalities in coal mines. The Geological Survey 
publishes data on metal statistics and minerals. In the census 
years it has authority to collect statistics from primary sources. 
Since 1880 it has collected statistics carefully as to the crude 

^ For other statistics published ]>y the Department of Labor see the United 
States Government Manual and see also Bureau of Labor Statistics, Selected 
List of Publications of the Bureau of Labor Statistics (1939), which can be 
purchased from the Government Printing Office. 

^ See page 82 on the Federal Reserve System. 
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oil lifted from the ground, iron ore, etc., watching the physical 
consumption of our natural wealth. It also collects and pub¬ 
lishes statistics on electrical power .production which are now 
considered useful in the study of the general trend of business, 
so important to business is the use of eleetricity. Other bureaus 
in the Department of the Interior are the Bureau of Education, 
Bureau of Pensions, and the Bureau of Indian Affairs, each 
publishing certain specialized statistics indicated by their titles. 

Department of Agriculture. The Department of Agriculture 
was not founded until 1862, but statistical work relating to agri¬ 
culture of a more or less systematic nature dates back to 1839, 
when Congress appropriated $1,000 out of the patent fund, to be 
expended under direction of the commissioner of patents, ''in 
the collection of agricultural statistics, and for other agricultural 
purposes.At the present time the great bulk of Federal 
statistics on agricultural matters is collected and published by 
the Bureau of Agricultural Economics, which originally was the 
Bureau of Statistics in the Department of Agriculture and later 
was known as the Bureau of Markets and Crop Estimates. In 
addition to a host of bulletins on special subjects related to 
agriculture, this bureau publishes a monthly report on weather 
conditions. Crops and Markets, and gives out estimates of annual 
crop yields. In recent years it has become the source of pioneer 
statistical work in the measurement of the factors influencing the 
demand for agricultural products and other similar statistical 
studies in connection with the conduct of the Agricultural Adjust¬ 
ment Administration. The agricultural yearbook, published by 
this Department, is a valuable record of agricultural progress in 
the United States and contains also extensive summaries of 
agricultural statistics. Since 1936 these summaries have been 
published separately under the title Agricultural Statistics. 
Current agricultural data are disseminated by the Department 
of Agriculture in its monthly publication, the Agricultural 
Situation. The Bureau of Agricultural Economics, which has 
direct charge of the abov^ publications, also furnishes part of 
the program for the Farm and Home Hour on the radio, designed 
to distribute timely agricultural information to the farming 
population of the nation. 

The administrative departments of the government thus con¬ 
stitute sources of statistics on a large scale, and statisticians 
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continually make use of these Federal sources of statistics. 
These publications of the government are available to everyone at 
very low cost and can be found for free use in most large libraries 
of the country or at offices maintained for the purpose by the 
government. 

The Independent Establishments. In addition to the adminis¬ 
trative departments of the national government there are many 
national commissions or boards or agencies, collectively described 
as the “independent establishmentsof the government. Some 
of these have become well-known sources of statistical data in 
special fields. The principal ones are the Interstate Commerce 
Commission, the Federal Trade Commission, the Federal Security 
Agency, the Federal Power Commission, the Federal Deposit 
Insurance Corporation, the Securities and Exchange Commission, 
the Tariff Commission, the Maritime Commission, and the Board 
of Governors of the Federal Reserve System. 

The Interstate Commerce Commission was created in 1887 
as the Federal government's solution of the railroad problem, 
following detailed Congressional reports of the situation, known 
as the Windom Report (1873-1874) and the Cullom Report 
(1886). These reports may be said to be the beginning of Federal 
railroad transportation and communication statistics. Since 
1887, such statistics have been gathered and published by the 
Interstate Commerce Commission, its powers having been gradu¬ 
ally extended to include other types of transportation, oil pipe 
lines, and express companies. In 1934 Congress created the 
Federal Communications Commission, which is devoted primarily 
to telephone, telegraph, cable, and radio. 

The Federal Trade Commission is the Federal source of data on 
the monopoly problem. In 1890 the Sherman Antitrust Act 
was passed; and in 1903 Congress realized that there was need to 
collect facts to be used as a basis for the enforcement of the 
Sherman Act. At the urgent request of President Roosevelt, 
Congress created the Bureau of Corpi^rations for the purpose of 
gathering data that would aid in the proper enforcement. Fol¬ 
lowing the passage of the Federal Trade Commission Act of 1914, 
the Bureau of Corporations was merged with the Commission. 
This Commission publishes reports on its investigations of various 
trusts, such as the investigation of coal, cotton, cereals, meat 
packing, and a number of others. During the 1920’s and 1930’s 



82 


INTRODUCTION 


it was a collector and publisher of statistics concerning trade 
associations and trade practices. 

The Board of Governors of the Federal Reserve System, which 
has operated since 1913, has become the greatest national source 
of statistics on banking and financial subjects. It publishes an 
annual report containing statistics on banking and related sub¬ 
jects, the Member Bank Call Report several times a year, and the 
Federal Reserve Bulletin, a monthly publication invaluable to 
bankers and statisticians working in banking subjects. In addi¬ 
tion, it publishes weekly mimeographed press releases on the 
condition of Federal reserve banks and of reporting member banks 
in order to make available more current data than is possible 
with the monthly or annual publications. In addition to financial 
and banking statistics the Board also has constructed through 
its Division of Research and Statistics an index of production 
calculated upon a comprehensive basis; this index and other 
special studies are also published in the annual reports and in the 
Federal Reserve Bulletin. 

The United States Tariff Commission, created in 1916, gathers 
statistics purporting to aid in the administration of the tariff 
laws and to help determine when duties should be raised or 
lowered. Owing to the strong influence of politics upon the 
question of the tariff, the studies of the Tariff Commission, Avith 
certain notable exceptions, constitute a great source of misuse 
of statistics. This was particularly true for the period from 
1920 to 1932 when most of its studies were for the purpose of prov¬ 
ing the need to raise tariffs. After the passage of the Reciprocal 
Trade Agreements Act in 1934 extensive improvements were 
inaugurated, and additional data were made available with the 
numerous studies that were conducted in cooperation with the 
State and other governmental departments. 

Finally, in connection with Federal statistics, it should be 
mentioned that frequently Congressional investigations result in 
the assembly and publication of valuable statistical material often 
constituting original sources or at least original compilations of 
such material. Mention has already been made of the Windom 
Report in 1873-1874 and the Cullom Report in 1886, both on 
transportation, which led to the creation of the Interstate Com¬ 
merce Commission in 1887. Other examples are the Pujo 
Money Trust Report of 1913 and the various reports of the Senate 
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and Hou^e Committees on Banking and Currency during the 
1930^8 on brokers^ loans, branch banks, the operation of the 
national and Federal reserve banking systems, foreign loans, and 
stock-exchange practices. Important Federal legislation of that 
decade was based on these investigations. 

Several noteworthy special commissions, created by Congress 
from time to time, have produced published documents that have 
become famous as great sources of primary statistical information. 
The Aldrich Reports from the Senate Committee on Finance, 
on Retail Prices and Wages (1892) and Wholesale Prices, Wages, 
and Transportation (1893) constitute extensive compilations of 
price data covering a period of over fifty years. These reports 
have been extensively used as source material for statistical 
studies of prices and wages for the period 1850 to 1900. 

The Industrial Commission created by act of Congress of 
June 18, 1898, submitted a report to Congress in 1902, consisting 
of 19 volumes and presenting a substantially complete epitome of 
the industrial life of the nation and of the important changes in 
business methods that occurred in the latter part of the nineteenth 
century. These volumes are largely statistical in their methods 
of description. The Immigration Commission, created in 1907, 
presented to Congress in 42 volumes a full inquiry into the sub¬ 
ject of immigration, reviewing statistically immigration to the 
United States during the period 1820 to 1910 and the com¬ 
ponent elements in our population as determined by immigration 
from 1850 to 1900. The National Monetary Commission, 
created in 1908, studied the banking and currency systems of 
the United States as compared with those of other countries. 
This Commission collected more complete statistical information 
with regard.to the banks of foreign countries such as Great 
Britain, France, and Germany than had ever been collected 
before and for the first time in this country obtained compa¬ 
rable statistics for all banks in the United States. The full 
report of the Commission, consisting of 24 volumes, was com¬ 
pleted in 1912 and served as the basis of the bank-reform 
legislaition known as the Federal Reserve Act. 

Other similar statistical studies in various fields of economic 
and social life have been made by commissions, such as those of 
the Select Committee on Wages and Prices established in 1910, 
the Commission on Industrial Relations created by an act of 
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1912, and the Commission on National Grants to Vocational 
Education, The Hoover Committees on Social Trends (1933) 
published extensive studies, partly statistical in character, of the 
economic and social life of the nation. 

One of the most notable of such temporary organizations was 
the National Resources Planning Board, established in the 
executive office of the President of the United States under 
authority of the Reorganization Act of 1939. This Board 
succeeded the National Resources Committee, which had been 
established in 1935. Earlier names of the same organization 
were National Resources Board and Advisory Committee and 
National Resources Board, which was created in 1934 to succeed 
the planning organization of the Federal Emergency Administra¬ 
tion of Public Works. When the United States Congress dis¬ 
covered what it felt was an attempt by the executive to usurp 
Congressional powers by having an economic planning board, 
it became hostile to the National Resources Planning Board. 
This hostility was not diminished when in 1943 the Board pre¬ 
sented to the executive a plan for the postwar expansion of the 
Federal security program. President Roosevelt handed the 
report over to Congress for action, but the Board was abolished 
in that year when Congress refused to vote funds for its con¬ 
tinued existence. During the course of its checkered career, 
however, the Board became the author of several noteworthy 
statistical publications’: Energy Resources and National Policy 
(1939), The Problems of a Changing Population (1938), Consumer 
Incomes in the United States (1938), Consumer Expenditures 
in the United States (1939), and The Structure of the American 
Economy (1939). 

State and Municipal Sources. The activities of the various 
state governments result also in the compilation and publication 
of statistics. Most states maintain departments of institutions 
and agencies that, through supervision of reform schools, prisons, 
hospitals, and the like, become sources of statistics on mental 
and physical pathology, as wdl as delinquency. Data concern¬ 
ing the records of penal and charitable institutions, hospitals, 
and asylums for the insane and feeble-mindod are primarily 
recorded by state or by municipal organizations. 

Vital statistics, that is, data relating to births and deaths and 
the classification of deaths by causes, have becomie an important 
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part of the demographic work of municipalities and states and 
have thus made the state and municipal governments important 
primary sources of data of this character. In addition, statistics 
on marriage and divorce are recorded through state and munici¬ 
pal licensing administration. 

Data are recorded by states and regularly reported, based on 
their tax-collecting, licensing, and registration responsibilities. 
For example, statistical data result from automobile registration 
by states. 

State incorporation laws result in the accumulation of data. 
State incorporated banks and trust companies and building and 
loan associations, for example, are all regulated by the banking 
departments of the various states, and statistics regarding these 
institutions are regularly compiled and published by these 
departments. Similarly, life insurance, fire insurance, automo¬ 
bile and casualty insurance, and workmen’s compensation laws 
and social-security laws have resulted in state-regulating bodies 
and the compilation and publication of statistical data on 
financial, commercial, and industrial subjects. 

A number of the larger and older of the industrial states have 
highly efficient labor departments, which compile and publish 
statistics of industrial conditions. Of increasing importance and 
interest to social scientists is the development of the volume of 
statistics relating to industrial accidents and diseases, growing 
out of the need for such statistics in the administration of the 
workmen’s compensation laws. 

The regulation of public utilities and water companies and 
street-raihvay and bus companies by state and municipal authori¬ 
ties has made the public-utility commissions of the states the 
principal primary sources of statistical data on these important 
industries, although in the 1930’s man}^ of these data were 
gathered by the Federal Powder Commission and the Security 
and Exchange Commission. 

WORLD STATISTICS 

Under the League of Nations progress has been made in the 
collection and publication of world statistics. These are pub¬ 
lished in the Monthly Bulletin of Statistics of the League of 
Nations and also in its International Statistical Yearbook and 
its annual World Economic Survey, Statistics on world com- 
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mercial banking and finance were published in special League 
publications. Previous to the work of the League of Nations 
in this respect, the World Almanac had for many years been 
highly valued as a rough-and-ready source of a variety of world 
statistics and still constitutes a popular source. 

The Statesman's Yearbook, published by Macmillan & Com¬ 
pany, Ltd., London, is a statistical and historical annual of the 
states of the world, giving data on population, area, finance, 
commerce, and banking, as well as figures on the fleets of the 
world and the world^s shipping. It has been issued annually 
since 1864. The United States government has always shown 
considerable, interest in statistics of foreign countries and has 
published them along with the domestic data; but this practice 
has been far more systematic and thorough since the First World 
War. For example, the Federal Reserve Bulletin regularly 
publishes statistics of prices, banking, and currency conditions 
in the principal nations of the world; foreign price statistics are 
published by the Bureau of Labor Statistics in its special bulletins; 
and statistics on trade between other countries, that is, the trade 
of the world outside the United States and not with the United 
States, are published by the Department of Commerce in Vol. 2of 
the Commerce Yearbook (as well as the statistics of our own foreign 
trade). In 1938 the Paris International Chamber of Commerce 
published a brochure on the economic statistics in 26 countries. 

In addition to such collections of statistics for all or a majority 
of the countries of the world, mention should be made of the 
sources, in greater detail than the world volumes, for statistics 
concerning three of the important countries of Europe. For 
England and the Dominions, there is the Statistical Abstract for 
the British Empire^ published by the Board of Trade. This 
combines what was previously published in the Statistical 
Abstract for the United Kingdom (first issued in 1864 for the years 
1840-1853) and the Statistical Abstract for the Several British Over¬ 
sea Dominions and Protectorates (first issued in 1864 for the years 
1850-1863). The French government publishes Annuaire statis- 
tique (1878) and the Bulletin de la statistique genirale (1911). In 
Germany the official source of statistics isthe StatistischesJahrbvch 
fur da^ deutsche Reich (1880). 

It has long been recognized that international statistics would 
be extremely important in obtaining true international, political, 
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and economic understanding and cooperation. Consequently, 
for many decades, efforts have been made to arrive at some sort 
of international understanding on methods to make the com¬ 
pilation of international statistics feasible or at least to improve 
existing world statistics. The statistics of each country are 
gathered according to the needs of that country; and since the 
problems in respective countries differ, so do the statistics. 
Their compilation and classification, according to varying 
definitions of units and varying bases of classification, produce 
startling differences in the final results. Then, too, the economic 
organizations of the various countries are different. A country 
with a large amount of transit trade and heavy reexportation 
of goods imported needs a different sort of classification of foreign 
trade statistics than a country doing little reexport business. 
Furthermore, the statistics themselves are gathered and organized 
in diverse ways in the various countries; the methods of collecting 
the statistical raw materials, the periods for which these data are 
gathered, and the methods of classification are not the same in 
the various countries. 

The endeavors made in the last eighty years for better inter¬ 
national statistical information, therefore, were first concentrated 
on the problem of rendering national statistics more comparable, 
since national statistics must be comparable between the various 
nations before they can be added up or compared to obtain 
international or world statistics. Qu6telet, the Belgian who did 
so much to organize comparable international astronomical 
observations, was likewise the first to try to solve the problem 
of obtaining the fundamental basis for better world and inter¬ 
national statistics. It is principally due to him that the First 
International Statistical Congress was organized in 1853 in 
Brussels. The main purpose of this Congress, the members of 
which attended in their private and not in their official capacity 
(although some were officials), was to bring about some degree 
of comparability in national statistics between the various 
nations. 

Another attempt to obtain international cooperation in 
statistical work was made in 1887 when the International Statis¬ 
tical Institute was formed. This organization, still in existence, 
elects members who are active in statistical work as professors, 
government officials, or members of private statistical offices. 
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The Institute cannot bind its members or the national govern¬ 
ments of its members but makes progress by suggesting improve¬ 
ments to different countries. 

The first official or semiofficial attempts for better world 
statistics were made in 1875 through the establishment of the 
International Bureau of the Universal Postal Union and the 
Bureau of the International Telecommunication Union (origi¬ 
nally called the International Bureau of the Telegraph Union). 
Both regularly gather statistics on postal and telegraphic develop¬ 
ments. Similar efforts in another field were made for the first 
time in 1882, by the International Congress for Hygiene and 
Demography. In 1905, another significant official attempt was 
made for greater comparability in world statistics. In that year, 
at the suggestion of the United States government, a meeting 
was held in Rome to formulate some plan for obtaining uniform¬ 
ity of agricultural statistics. This meeting led to the founding 
of the International Agricultural Institute, which still is active 
in the gathering of world statistics on agriculture, production, 
consumption, prices, and trade. The statistical information 
assembled by this body is published monthly and yearly and 
special publications are also issued. Sixty-two different coun¬ 
tries are members of the Institute. The Institute was very 
successful in putting national agricultural statistics on an 
internationally more comparable basis and in assembling regu¬ 
larly good and reliable world statistics on all fields of agriculture. 

Since the First World War, the League of Nations has been 
the natural organization to proceed with the work of interna¬ 
tionalizing statistics. Shortly after its establishment, the League 
started that work. At the International Economic Conference 
of 1927 the problem of comparable national statistics in order to 
secure good world statistics was studied. The League of Nations 
subsequently brought about an official meeting on the subject 
of international statistics and called an International Statistical 
Conference to meet in Geneva in November, 1928. The keynote 
of the Conference was that the general adoption of comparable 
international statistics was desirable for good international 
policies and in the interests of permanent world peace. The aim 
of the Conference was to bring about the broadening of the scope 
of national statistics in all countries where it seemed to be needed 
and to attempt to make national statistics in different countries 
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comparable. The Conference emphasized once more that such 
attempts meet with many difficulties. Of the 42 countries repre¬ 
sented (some nonmembers of the League, like the United States, 
were also represented), only 29 countries felt they could sign the 
Convention and Protocol of the Conference. To induce that 
number to sign, it was necessary to limit greatly the program of 
work. 

Nevertheless, the Conference of 1928 did produce good results. 
A number of points were discussed, and important conclusions 
were reached. In addition, the Conference created a committee 
of technical experts to meet from time to time and make sugges¬ 
tions for further progress. This group met in March, 1931, and 
formulated a constitution for future work. It met again in 
December, 1933, to discuss problems of statistics on foreign 
trade. Up to the present time, its contribution to the solution 
of the problems involved has been inconsiderable, but it may 
make advances in this important work if the countries concerned 
will be willing to carry out the recommendations made by it, as 
they are apparently committed to do by the Convention and 
Protocol of the Conference of 1928. 

In 1936 the twenty-third session of the International Institute 
of Statistics was held at Athens. At that session there were 
75 members, of which 10 were from North America. Twenty- 
seven countries designated official delegates. Also, the Secretary 
of the League of Nations, the International Labor Office, the 
International Institute of Intellectual Cooperation, the Inter¬ 
national Institute of Agriculture, and the International Chamber 
of Commerce were represented.^ 

In May, 1940, one of the 11 sections of the Eighth American 
Scientific Congress convened by the government of the United 
States in connection with the observance of the fiftieth anniver¬ 
sary of the founding of the Pan American Union was devoted to 
statistics. The program of the section had the following broad 
objectives: (1) improvements in the comparability of official 

^Stuart, Prof. C. A. Verijn, “La XXIIIeme session de Tinstitut 
international de statistique, Athenes, 1936,” Revue de Vinstitut international 
de statistique, vol. 4 (1936), pp. 367-403. The citation includes the summary 
of resolutions of the session (pp. 378-395) and communications from various 
delegations on methods, legislation, organization, and administration, of 
statistics (pp. 396-403). 
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statistics among the American nations; (2) improvements in 
statistical methodology; (3) the furtherance of acquaintance 
among the statisticians of the American continent; (4) con¬ 
sideration by these statisticians of the possible development of a 
continuing professional medium for the interchange of statistical 
ideas and information. Correspondents in several of the 
American nations had pointed to the need for closer profes¬ 
sional collaboration among the statisticians of this hemisphere, 
and it was proposed to explore at this meeting the possibilities 
of establishing some kind of an inter-American statistical organi¬ 
zation of professional character. The result was the formation 
of the Inter-American Statistical Institute. 

A new quarterly, the Estadistica^ published in Mexico, is the 
official organ of the Inter-American Statistical Institute, con¬ 
stituting one of its mediums for fostering statistical development 
in the Western Hemisphere. It endeavors to acquaint the 
persons in one country mth statistical developments in other 
countries, to inform its readers concerning the availability of 
data, to present articles that will tend to encourage the adoption 
of improved methods, and hence to improve the quality of data. 
Articles may appear in any of the following four languages: 
Spanish, English, Portuguese, or French. An author’s sum¬ 
mary accompanies each article; the summary is reproduced in 
several languages. The Inter-American Statistical Institute 
also publishes a yearbook of statistics including statistical data 
for Latin-American countries and North America. 

Prospects to secure comparable world statistics and for inter¬ 
national statistics fluctuate with the rise and fall of isolationism 
and nationalism. Under the League of Nations and under the 
Pan American Union progress has been encouraged, only to be 
hampered by ever-persistent isolationism in one country or 
another. Nevertheless, the need for comparable data with 
respect to all nations of the world has become more and more 
evident, it has come to be more and more appreciated as the 
problems have been studied by these various institutes, con¬ 
ferences, and committees, and more and more is it coming to 
be realized that such statistics are a pressing necessity to busi¬ 
nessmen with interests spread far and wide over the inter¬ 
national field. 
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While it has been stressed in this section that there are as yet 
no truly comparable international statistics, the student of 
international affairs and the international businessman will be 
able to obtain what constitutes for the present the closest 
approximation to them from a number of sources, chief among 
them the following: (1) International Statistical Yearbook (pub¬ 
lished by the League of Nations); (2) Vol. 2 of the Commerce 
Yearbook (published by the United States Department of 
Commerce); (3) The International Appendix to the Statistics 
Yearbook of Germany (Statistisches Jahrbuch filr das deutsche 
Reich); (4) the Statesman's Yearbook. The World Peace Founda¬ 
tion publishes also a subject index to the economic and financial 
documents of the licague of Nations. 



CHAPTER IV 


PRESENTATION OF STATISTICS 
TABLES 

Principles of Tabulation, Tabulation is the mechanical part 
of classification. Its function is so to arrange the physical pres¬ 
entation of quantitative facts that there can be no misinter¬ 
pretation of their significance. The attainment of this object 
depends upon the following principles: 

1. Concise, clear, and complete titles attached to the table. 
Usually the title is placed at the top, above the table, but it is 
sometimes placed at the bottom. The function of the title is to 
give a general description of the contents of the table. 

2. Careful, unambiguous description of the units of measure¬ 
ment or presentation used in the collection and recording of the 
data. This is ordinarily placed immediately under the title. 
Subheadings frequently require definition of units. 

3. The arrangement of the data in columns and rows accord¬ 
ing to a clearly indicated basis for classification. 

4. The exact description of columns and rows by the use of 
caption headings and stub headings. 

5. Footnotes to clarify headings or subtitles or to specify 
limitations of particular figures. 

The scheme shown on page 93 gives an abstraction of the 
mechanics of tabulation. It shows the position of the title and 
the description of units above the table and for illustration 
designates four columns, numbered (1), (2), (3), and (4), and 
three rows, lettered {x)j (y), and ( 2 ). 

The four columns are subcolumns—(1) and (2) are subcolumns 
of column (a), and (3) and (4) are subcolumns of column (6). 
The caption headings would appear in the spaces designated 
(a) and (6), respectively; and subcaption headings would appear 
in the spaces designated (1), (2), (3), and (4). Similarly, the 
three rows are described by stub headings appearing in (x), (y), 
and ( 2 ). The space (D) is for the general description of the stub 
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headings. It is possible also to have stub subheadings. In 
order to illustrate further, there is reproduced in Table 1 on 
page 94 data compiled from the replies to the questionnaire 
shown on pages 46-47. 


Title 

(Description of units) 


(D) 

(a) 

(b) 

(1) ; (2) 

(3) 

(4) 

(x) 

i . 1 

j 

(y) 


(2) 

1 > 1 


General-purpose and Special-purpose Tables. A mere glance 
at the specimen taken from the publication of the United States 
Department of Agriculture is sufficient to lead to the conviction 
that such tables are not meant for light reading. They are 
essentially reference tables, or general-purpose tables. The prin¬ 
cipal guide in the jonstruction of general-purpose tables is to 
include as much as possible in as small a space as possible, con¬ 
sistent with presentation of the amount of information deemed 
necessary. Thus the tables contained in such publications as 
the United States Census reports or the Federal Reserve Bulletin 
or the Survey of Current Business may not constitute popular 
reading; but they are a great boon to all who seek ready access 
to details, arranged in a manner so facilitating their discovery 
by the careful observer that looking up a particular figure is 
almost as easy as looking up a word in the dictionary. 

When a table is to be read—^is to tell a story—^it is called a 
special-purpose table. Such a table should have as its out¬ 
standing characteristic the quality of simplicity. It should not 
try to tell too much at once; if necessary, more than one table 
may be used for telling a more complex story. Special-purpose 
tables should have a great deal of white space in and around 
them to make lazy readers (and most people are lazy when it 
comes to reading tables of figures) think them easy to read. 
The type or print should be sufficiently large for easy reading. 
The reader should be adequately prepared or oriented to the 




94 


INTRODUCTION 


a ^ 

§ 

S 3 ^ 
o ^ 52 
^ 2 

a« 

a «2 g 
^ ^ o 
t? o a 

I i 

I M« 

Pi ^ O ^ 
^ “ O J3 

a » +3 

S M o 

1 cc a ^ 

3 § M 

3 g o I 

o fa a ^ 

S = ^ a 

> 23 g c3 
23 ^ CQ -Ti 
O a , C3 
a a u «3 

« Q a ■s 

a ^ u ^ 

a o ^ 

2 S 
^ . o 


a S 

S g ^ ^ 

S J s’a 

rr C 


o «3 ^ 

u a cc 23 

Iz; > a a 

M <1 g g 

jH o 

g « ffi o 
^ § fa 2 

J o| 

o ^ ^ 

ro a 

a 2 

I ^ J 

g p <j 

g « 


O —< =0 j 

o Oil/5rH|>COC<|ThC<IOOCOrHO 

^|B- i 

i 05 I CO rtT’^ 1 > t>. o ’-H iO OS i-H o lO »0 »-H 
fflOO!00»C>OTjHCO CDl>t^05CM 
C'q 5 ! « ^ w IM 

I _I___ 

I, C^ QOCOCO'oOTdH'eOCOOsi-iFHOOOCDi-H 

-^'J. «05 COCOiOTtH-^CDl^t^OC^^iOt^t^ 

rHrHi-(rHT-(C<l 


(N(N(NO 


t^iCI>C0i-H CO(NCOO 

rj<i0^i005 Oi^OvOt^ 

rH T—t lO 


C^ CO (N CO T-H 

tH CO 00 CO CO 

^ <N ^ O 


CO CO •-* 05 O 

»0 00 05 CD CO 

05 Tfi *> 05 ^ 


CO 05 05 05 
CO 00 00 05 
X CO l> 


! O X 05 o T-H 

' 05 05 CO X 05 

X lO X O 


O 05 CO X 1-4 

O X l> T-I lO 

X »0 l> o 05 


X O O X 
lO TJH 11^4 CO 
lO O uo 05 


I 

x-.' ' m ■ 

o .*0 2 X 
CQ^ - 

r 

^|s^- 

S ^ 


X rH »0 X LO 

lO 05 05 !> X 

i-x X CO X 1-1 


X 05 lO rfi 05 

CO r-i lO rH |> 

00 CO X T-H X 


X X CO 05 
O 05 tH X 

1> IC 


tJh X CO X o o 


!« ^ 

1 ^ 

U: £ fc. X 

It-’ 3 5? 05 


. ^ 05 O X 

I 05 ri4 1 > o 


X ^ TJ4 05 TJH 
05 tC 1> 1-^ 


XXXt^X XCOICCO 

X'5J405tHO 5 ^t^TlHX 


t>r-«XOX Xt^COCO 
X»0X05 05 i-Ht-irtX 


i-(rHX05 05 XXX »0 0 10X05X 


Other 

sources 

0 

Num¬ 

ber 

721 

Earn¬ 

ings 

S' 

Num¬ 

ber 

2,654 

Any 

source 


Num¬ 

ber 

2,705 


05 O ^ TJ4 ^ 

T—( 1—( 1—1 

r-l 05 X 


TJ4 X X O 
^ ^ 05 05 t-1 
t-h 05 X 


O X X to TtH 

05 Tt4 05 05 

rH 05 X 


OI>COXO X05i-iri4 
05rt40T^X X05t^T-H 
XXX05i-It-Ii-I r-l 


Tj^Ot^X^ 005i—I 
05iCOrt4X 05 05t^ 
X X X 05 1-t 1-1 rH 


'-^Ot^X^ 005rH 

05iOO'^X 05 05t- 

X X X 05 1-1 i-t r-( 


05 0505050505 0505 
rfH 05'«!t405^05 0505 
05 '«!fl^O505'«^4 05-^ 


oi52:§i I 

too»ooto o___, 

050000 05»0t^005 lOO^tOOl 
i « S S i-h" t-T r-T i-T 05 " 05 " 05 " CO x" I 


Source: Bureau of Home Economics, United States Department of Agriculture, “Consumers’ Purchases Study,” Family Income and Expenditure^ 
Pacific Region, Part I, Family Income, Miaeellaneous Publication 339, p. 22. 



























PRESENTATION OF STATISTICS 


95 


table by the text accompanying it and particularly by the title 
of the table. Briefly, the story of the table should be told in 
literary form in the text, reliance being placed on the table 


Table 2.—Average Disbursements of Consumer Units^ in Each Third 
OF Nation, 1935-1936 ^ 


1 

Average disbursements 
of families and single 
individuals in 

Percentage of income 

Category of disbursement 

Lower 

third, 

incomes] 

under 

$780 

1 

Middle 

third, 

incomes 

of 

$780- 

$1,450 

Upper 
third, 1 
incomes 
of 

$1,450 

and 

over 

Low’er 

third 

Middle 

third 

Upper 

third 

Current consumption: 

Food. 

$236 

$ 404 

$ 642 

50.2 

37.5 

21.7 

Housing. 

115 

199 

408 

24.4 

18.5 

13.8 

Household operation. 

54 

108 

240 

11.4 

10.0 

8.1 

Clothing. 

47 

102 

251 

10.0 

9.5 

8.5 

Automobile. 

16 

57 

215 

3.3 

5.3 

7.2 

Medical care. 

20 

41 

106 

4.3 

3.9 

3.6 

Recreation. 

9 

28 

89 

1.8 

2.6 

3.0 

Furnishings. 

9 

28 

72 

1.8 

2.6 

2.4 

Personal care. 

12 

22 

44 

2.5 

2.1 

1.5 

Tobacco. 

10 

23 

40 

2.2 

2.1 

1.4 

Transportation other t han 
auto. 

11 

19 

37 

' 2.4 

1 

1.7 

1.3 

Reading. 

6 

12 

23 

1.3 

1.2 

0.8 

Education..,. 

2 

7 

30 

0.5 

0.6 

i 

Other items. 

3 

6 

15 

0.6 

1 0.5 

0.5 

All consumption items.... 

$550 

$1,056 

$2,212 

116.7 

98.1 

74.8 

Gifts and personal taxes^. 

$ 13 

$ 39$ 181 

2.8 

3.7 

6.1 

Savings. 

-92 

-19 

i 566 

-19.5 

-1.8 

19.1 

All items. 

$471 

1 

$l,076j$2,959 

100.0 

100.0 

100.0 


1 Includes all families and single individuals, but excludes residents in institutional groups, 
® Taxes shown here include only personal income taxes, poll taxes, and certain personal 
property taxes. 

Source: National Resources Committee, Consumer Expenditures in the United States, 
Estimates for 1935-36 (1939), p. 40. 

merely as a dramatic summary. Simple devices to aid inter¬ 
pretation and facilitate the mental vision of the table have a 
useful place in special-purpose tables, such as accompanying 
relative figures, methods of emphasis such as italics, or the 
scheme of ruling the table. 
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The object of a special-purpose table may also be to compress 
into a small space a body of information “the narration of which 
in the text would be cumbersome and exhausting to the reader. 


Table 3.— Share of Each Third of Nation’s Consumer Units^ in 
Aggregate Disbursements, 1935-1936 



Aggregate disbursements, 
millions 

Percentage of aggregate 
disbursement for each 
category made by 

Category of disbursement 

Lower 

third, 

incomes 

under 

$780 

Middle 
third, 
incomes 
of $780- 
$1,450 

Upper 
third, 
incomes 
of $1,450 
and over 

Lower 

third 

Middle 

third 

Upper 

third 

Current consumption: 

Food. 

$3,108 

'S 5,310 

$ 8,447 

18.4 

31.5 

50.1 

Housing. 

1,515 

2,621 

5,370 

15.9 

27.6 

56.5 

Household operation.... 

703 

1,422 

3,160 

13.3 

26.9 

59.8 

Clothing. 

618, 

1,338 

3,305 

11.7 

25.5 

62.8 

Automobile. 

203 

755 

2,823 

5.4 

20.0 

74.6 

Medical care. 

264 

546 

1,395 

1,166 

12.0 

24 7 

63 3 

Recreation. 

115 

362 

7.0 

22.0 

71.0 

Furnishings. 

112 

368 

942 

7.9 

25.9 

66.2 

Personal care. 

155 

292 

585 

15.1 

28.2 

56.7 

Tobacco. 

134 

301 

531 

13.8 

31.2 

55.0 

Transportation other 
than auto. 

150 

247 

487 

17.0 

27.9 

55.1 

Reading. 

84 

165 

302 

15.3 

29.9 

54.8 

Education. 

30 

87 

389 

5.9 

17.2 

76.9 

Other items. 

^ 35 

76 

196 

11.4' 

24.6 

64.0 

All consumption items. 

$7,226|$13,890| 

$29,098 

14.4 

27.7 

1 57.9 

Gifts and personal taxes^... 

$ 171$ 516'$ 2,380| 

5.6 

16.8 

, 77.6 

Savings.| 

-1.207, 

-252| 

7,4371 

-20.2 

-4.2 

' 124.4 

All items.! 

i 

$6,190|$14,154|$38,915 

10.4 

23.9 

65.7 


1 Includes all families and single individuals, but excludes residents in institutional groups, 
* Taxes shown here include only personal income taxes, poll taxes, and certain personal 
property taxes. 

Source: National Resources Committee, Consumer Expenditures in the United States, 
Estimates for 1935-36 (1939), p. 51. 


It is, in short, a method of condensation, and it is of the utmost 
importance that, as it tells so much in so small a compass, it 
tell it as clearly as practicable/^^ 

^ Falkner, Roland P., ^^Statistical Tabulation and Practice,” Journal 
of the American Statistical Association^ vol. 11 (1916), pp. 192-200. 
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Tables 2 to 6 are examples of special-purpose tables. They 
tell stories that are more or less hidden in the detailed but well- 

Tablb 4.—Percentage Distributions op Nonreliep Families^ in Six 


Types of Community, by Income Level, 1935-1936 


Income level 

All 

families 

Families living in 

Urban communities 

Rural communities 

Metrop¬ 

olises,* 

1,500,000 

popula¬ 

tion 

and 

over 

Large 

cities, 

100,000- 

1.500,000 

popula¬ 

tion 

Middle- 

sized 

cities, 

25,000- 

100,000 

popula¬ 

tion 

Small 

cities, 

2,500- 

25,000 

popula¬ 

tion 

Non¬ 

farm* 

Farm 

Under $250.... 

2.8 

1.7 

2.0 

2.4 

3.1 

3.0 

3.8 

$250-$500. 

7.8 

2.8 

4.4 

5.5 

6.3 

8.9 

13.9 

$500-$750. 

11.3 

5.2 

7.6 

9.4 

10.3 

11.8 

18.0 

$750-$!,000. . . 

13.4 

8.5 

10.5 

13.6 

13.9 

14.4 

16.6 

$1,000-$!,250.. 

13.2 

10.9 

12.4 

13.9 

14.6 

14.0 

12.8 

$1,250-$1,500.. 

10.8 

11.0 

10.6 

11.6 

11.1 

11.6 

9.8 

$1,500-$1,750.. 

9.1 

10.8 

10.0 

9.7 

9.4 

9.1 

7.0 

$l,750-$2,000.. 

7.3 

9.7 

9.0 

8.5 

7.8 

6.5 

4.8 

$2,000-$2,250.. 

5.5 

7.9 

6.9 

6.1 

5.8 

5.1 

3.1 

$2,250-$2,500.. 

4.0 

5.8 

5.5 

4.5 

4.0 

3.4 

2.5 

$2,500-$3,000.. 

5.2 

8.5 

7.1 

5.4 

5.3 

4.4 

2.9 

S3j000—$3j500.. 

3.0 

4.7 

4.2 

3.1 

3.1 

2.3 

1.6 

S3,500-$4,000.. 

1.8 

2.9 

2.7 

1.7 

1.7 

1.3 

1.0 

$4,000-84,500.. 

1.0 

1.7 

1.6 

1.0 

0.8 

0.8 

0.5 

$4,500-85,000.. 

0.6 

0.9 

0.9 

0.7 

0.5 

0.6 

0.3 

$5,000-87,500.. 

1.3 

2.1 

1.8 

1.3 

1.1 

1.4 

0.6 

87,500-810,000 

0.8 

1.6 

1.1 

0.6 

0.6 

0.6 

0.4 

810,000 and 








over. 

1.1 

3.3 1 

1.7 

1.0 

0.6 

0.8 

0.4 

All levels.. .. ^ 

i 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 


1 Excludes all families receiving any direct or work relief (however little) at any time 
during year. 

* Metropolises of this size are in North Central Region only (New* York, Chicago, Phila¬ 
delphia, and Detroit). 

* Includes families living in communities with population under 2,500, and families living 
in the open country but not on farms. 

Source! National Resources Committee, Consumer Incomes in the United States, Their 
Distribution in 1935-36 (1938), pp. 24-25. 

organized statistics collected by means of the questionnaire 
referred to above. In order to simplify the data for presentation, 
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income levels are divided into three groups, lower third, middle 
third, and upper third. These tables illustrate also the use of 
percentage figures to facilitate their interpretation. 

CHARTS 

Quick visualization of many rather complex situations can 
be readily achieved by merely looking at a simple chart. It is 
said that nowadays the first step toward using a series of data 
for any sort of analysis is to represent the figures by a line drawn 
on a chart. So useful is the chart in giving a quick grasp of the 


BILLIONS OF DOLLARS 



Fig. 11.—Federal expenditures for war activities. {From data published in Daily 
Statement of the United States Treasury Department.) 

characteristics of data that it has been adopted in many popular 
books, in magazines, and in the financial section of metropolitan 
newspapers. Figures 11 and 12 illustrate dramatically the 
manner in which charts are used to aid in visualizing important 
developments during wartime. In peacetime the trends of 
data, even though less sensational, are watched with care, and 
charts greatly facilitate their analysis. 

The invention in 1786 of charting is claimed by William Play¬ 
fair, who set forth its advantages as follows:^ ‘‘As the eye is the 

^ The Commercial and Political Atlas (3d ed., London, 1801), p. x. Play¬ 
fair's claim to be actually the first who applied the principles of geometry 
to matters of Finance” is made on pages viii and ix. Cited from W. C. 
Mitchell, Business Cycles—The Problem and Its Setting^ p. 209. In An 
Enquiry into the Decline and Fall of Nations Playfair is said to have been the 
first to employ graphical devices in the treatment of sociological discussion. 
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best judge of proportion, being able to estiinate it with more 
quickness and accuracy than any other of our organs, it follows, 
that wherever relative quantities are in question, a gradual 
increase or decrease of any . . . value is to be stated, this mode 
of representing it is peculiarly applicable; it gives a simple, 
accurate, and permanent idea, by giving form and shape to a 
number of separate ideas, which are otherwise abstract and 
unconnected.” 



JASONOJFMAMJJASONOJFMAMJJASONOJFMAMJJASONO 


1940 1941 1942 1943 

Fig. 12.—Production of munitions, including ships, planes, tanks, guns, ammu¬ 
nition, and all field equipment. {Data from War Production Board.) 

While the idea underlying the use of charts is quite old, the 
general use of charts for wide public consumption is of much 
more recent origin and probably owes its present-day popularity 
to inventions having to do with the plating of charts for printing. 
From being largely a hand-labor process, the making of plates 
for the reproduction of charts has come in recent years to be a 
photoelectric process, with the result that today the most 
expensive part of the charts in a book, newspaper, or magazine 
article consists in the mental and hand labor involved in the 
original construction of the chart. 

There are five kinds of charts: (1) pictograms, (2) cartograms, 
(3) frequency curves, (4) bivariate charts, and (5) curves pictur¬ 
ing time series. 


‘‘ William Playfair was, one may say, the Sir William Petty of the Edinburgh 
group . . , Jjancelot T. Hogben, Dangerous Thoughts (1939), p. 283. 
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^‘milk dollar.’^ The small obscured piece at the top represents 
2.98 cents of profit for the New York City distributors. 

Areal and cubic comparisons are not frequently used because, 
instead of simplifying the comparison desired, they are likely 



to confuse it. This is because the mind finds difficulty in 
quickly differentiating sizes of areas or of cubes. Figure 14 
shows two areas in the form of squares. One of these areas is 
actually one-half as large as the other; but, at first glance, 
it seems to be more than half as large. Consequently, if com¬ 



parison of two quantities is desired by charting, areal presenta¬ 
tion is not a desirable method of obtaining easy comprehemion 
of the differences that it is desired to stresd. 

The difficulty is increased if the attempt is made to chart 
differences of magnitude by the use of cubes, for it is still more 
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difficult for the eye and mind to grasp geometric comparative 
magnitudes in three dimensions. This is shown in Fig. 15, 
which depicts two cubes, one of which is one-half as large as the 
other, though a first glance makes it appear to be two-thirds 
as large. For this reason the use of pictures for making com¬ 
parisons is not considered to be the best practice. For example, 
the presentation for quick visualization of different-sized men 
in uniform to represent the relative fighting strength of various 
countries or of different-sized battleships to represent the relative 
size of navies 'vvfill confuse the interpretation that the eye and 
mind will give to the relative sizes compared, even though the 
relative size is given purely a linear setting in the actual drawing 
of the figures.^ Onty the height of the uniformed men may be 
varied, but this might lead to comically proportioned men and 
an illusion of armies of tall thin men vs. armies of short fat men. 
If the uniformed men are properly proportioned for their varying 
heights, this results in an areal comparison. 

Consequently, the most generally used types of pictogram 
are those involving merely linear comparisons and the use of 
purely abstract linear distances. Rows of soldiers, each soldier 
representing a specified number of men, may be used to advan¬ 
tage, however, the longer row representing the larger army. 
Similarly, large and small navies can properly be compared by 
rows of ships, each ship representing a specified tonnage of that 
type of warship. Such pictograms are really linear comparisons 
as also are bar charts and sectors of circles. 

Bar Charts and Sectors of Circles, The use of bar charts 
and sectors of circles is widely practiced and finds its application 
whenever it is desired to compare two or more differing mag¬ 
nitudes with each other or to give quick visualization of com¬ 
ponent parts of a given magnitude. Extensive use of vertical 
or horizontal bars is made by the United States Bureau of the 
Census in the Statistical Atlas of the United States, one of which 
was issued in 1914 and another in 1924. In addition, many 
modern writings, especially in the fields of the social sciences, 
attempt to portray by charts the statistics it is desired to present 
for popular reading. 

^ Cf. Croxton, F. E., and Harold Stein, Graphic Comparisons by Bars, 
-Squares, Circles and Cubes ,Journal of the American Statistical Association, 
Vol. 27 (1932), pp. 54-60. 
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Figure 16 is a graphic portrayal of the budget expenditures 
of the Federal government, based upon legislation in effect in 
February, 1943, in which the blacked-out portion of the vertical 
bars reveals in a striking manner the expected increases from 
year to year in expenditures for Avar activities. ' 

BILLIONS OF DOLLARS 



1942 1943 1944 

•-Fiscal Yeors-► 


* Transactions in checking accounts. 

2 Includes statutory public debt retirement. 

Fig. 16.—Budget expenditures of the Federal government, based upon legislation 
as of February, 1943, {The Budget of the United States Government.) 

The use of horizontal bars is illustrated in Fig. 17, which 
shows graphically the statistical data in Table 4. The differences 
between distribution of income among nonrelief families in 
metropolitan areas as compared Avith that among families on 
farms is seen at a glance, and a slight scrutiny of the bars brings 
out the less dramatic but clear differences in the distribution of 
income in small cities compared Avith that in the larger ones. 

Another government publication contains data, shown in 
Table 6, from Avhich charts Avere drawn that illustrate the use 
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of the component-part bar chart. The data that appear in 
these tables are shown in component bars in Fig. 18, where 
the length of the bar is varied in accordance with the income 
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Fig. 17.—Income distributions of nonrelief families in six types of community, 
193&-1936. (Based on Table 4.) 


level. This makes possible the visual comparison of the average 
total family expenditure at various income levels. For example 
at the income level of $2,000 to $2,500 the aggregate family 
expenditure averages a little over $2,000. At the same time, 
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the amount spent for various purposes can be seen from the 
differently crosshatched parts of each bar. Throughout the 
bars one kind of crosshatching represents a specified kind of 

Income level 

♦ 15.000 20.000 

•0,000 15.000 

5.000 10.000 

4.000 5.000 

3,000 4.000 

2.500 3.000 

2.000 2.500 

1.500 2.000 

1.000 1,500 

500 1.000 

Under i 500 
Average all levels 

f^Food (HHousmq ^Clofhinq □^'-^^onTob.le ^Savings 

N<rfc:Taxes shov/n here indude only personal income tOAcs.poli foxes.and certoln personol property foxes 

Fig. 18. —Use of income by Aincricnn families at different income levels, illus¬ 
trating the use of bar diagrams. (Based on Table 6.) 

Income level 

♦ 15.000 20.000 

10.000 15.000 

5.000 10.000 

4.000 5.000 

3.000 4.000 

2.500 3,000 

2,000 2.500 

1.500 2,000 
I.OOO J.500 

500 1,000 

UnderiSOO 

Average ail levels 

0 20 40 60 80 100 120 140 160 

Per cent of income -Negative savings-.> 

^FoikI HHoosm, ^''‘>"’•"5 □''■*«** H^S^mptioo ^S.»,ng!, 

NoteiTaxes shown here include only personal income taxes,poll taxes,and certain personal preperty taxes 

Fig. 19. —Percentage use of income by American families at different income 
levels, 1935-1936, illustrating the use of 100 per cent bar diagrams. (Based 
on Table 6.) 

expenditure. The second desirable comparison is still more 
quickly grasped by the use of 100 per cent (Component part bam, 
which is illustrated in Fig. 19. When such a chart is drawn, 
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it is always advisable to warn readers that 100 per cent bar 
charts are being used; in addition, the table of actual figures 
should be given for the actual figures are completely concealed 
in the relative figures if only the chart is given. It will be 
noticed that clever arrangement of crosshatching, placing con¬ 
trasting types adjacent to each other, aids greatly in the reading 
of the chart. Figures 20 and 21 are interesting uses of the bar 


Food 


Housing 

Household operation 
Clothing 
Automobile 

Other items 

Gifh and taxes 
Savings 

50 , 60 70 60 

Size of income,billions of dollars 

Fig. 20.—Variation in expenditures with income, illustrating the use of a cross- 
hatched zone diagram. [National Resources Committee, Consumer Expenditures 
in the United States, Estimates 1935-1936 (1939), pp. 165-166.] 

chart, virtually in the form of zones, to show the distribution of 
the consumer food dollar on the assumption of four different 
total national income levels. The same data are shown in 
Fig. 21 in the form of a 100 per cent bar or zone chart. The use 
of the zone effect has the advantage of aiding the eye to make 
the principal indicated comparisons. 

There are many examples of the use of sectors of circles in 
the Statistical Atlas of the United States^ census of 1920, and a 
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number in the publications of the census of 1930. Figure 22 is an 
example of a single circle divided into sectors representing 
component parts in the utilization of milk in the United States 
in 1929. As in the case of the component bar charts, so also 
in the case of sectors of a circle, it is possible to represent changes 


Food 


Housing 

Household operation 

Clothing 

Automobile 

Other items 

Gifts and taxes 

Savings 

50 60 70 80 

Size of income,billions of dollars 

Fig. 21.—Variation in percentages of various expenditures with income, 
illustrating the use of a 100 per cent crosshatched zone diagram. [National Re¬ 
sources Committee, Consumer Expenditures in the United States, Estimates 1935- 
1936 (1939), vp. 165-166.] 

from time to time in percentage components by the use of a 
series of circles. It is not advisable to use the sectors and 
circles as bars were used in Fig. 21, namely, to picture relative 
change and total change simultaneously. To do this with 
sectors and circles involves areal comparisons that are not 
grasped by the readers of the charts. In Fig. 23, which is 
presented to illustrate the use of sectors and circles, the attempt 

































Fig. 22.—Utilization of milk in the United States, 1929, illustrating the use 
of sectors of circles. Based on value. {Fifteenth Census uf the United States, 
1930, Vol. 4, Agriculture.) 

Perhaps it is sufficient to have the smaller 1938 circle call atten¬ 
tion to the fact and then assume that the reader will be led 
thereby to note the figures, which are shown in a separate table. 
But the figure shown in each sector of the circles is a component 
percentage and does not throw light on aggregate amount. 

For the purpose of shovdng graphically the component parts 
of a total, the split-bar chart is a promising new device. Figure 
24 illustrates its use to show the distribution of the consumer 
food dollar. Comparison between consumer dollars of varying 
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PERCENT OF TOTAL 



■■ CANADA & NEWFOUNDLAND 
f=l EUROPE 

LATIN AMERICA 
B3B REST OF WORLD 


Fig. 23.—The United States’ long-term investments in foreign countries, end 
of 1930 and end of 1938, illustrating use of circles of different sizes. {Bureau of 
Foreign and Domestic Committee, '"The Balance of International Payments of the 
United States in 1938,” p. 49.) 



Fig. 24.—Distribution of consumer food dollar, 1935, illustrating use of a 
split-bar chart. [National Resources Committee, The fracture of the American 
Economy, Part I, (1939), p. 68.] 
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purchasing power could be sho\vn by differences in the over-all 
lengths of the bars used. 

Simple bar charts and sectors of circles, it will be noted, do 
not involve areal comparisons to depict component parts; the 
bars and sectors are areas, it is true, but it is not the areas that 
are compared. The comparisons are between the varying lengths 
of the bars, because the bars are of uniform wdth. Bars of 
varying A\idths would complicate the comparisons and make 
them areal. Moreover, sectors of circles do not involve areal 
comparisons, because the comparisons visualized are the arcs 
cut on the circle by the angles from the center of the circle. 
The visual comparison is therefore a linear one. Everyone is 
used to estimating the size of a piece of pie. 

Cartograms. As the name indicates, cartograms are maps. 
Generally, outline maps are used and various devices employed to 
picture varying characteristics of different parts of the country. 
All are familiar with the colored maps that show the mountainous 
sections in brown shaded off to the green of the lowlands, the 
light brown in between being the higher plateaus, but not moun¬ 
tains. The same principle is used in a variety of ways to present 
statistics regarding geographically classified characteristics 
of the country by the use of maps. These will be classified and 
briefly described and illustrated. 

Cartograms by Dots, or Points, Dots varying in size for 
different quantities are used in the first class of cartogram of 
this type. Because of the necessity of making areal comparisons, 
that is, of using different-sized circles, this type of cartogram is 
not widely employed and is considered not a good method of 
presenting subjects in cartogram form. An example in Fig. 25, 
however, shows clearly the areas of geographical concentration in 
1935 of wage earners in manufacturing industries of the United 
States. An attempt is made to facilitate the areal comparisons 
involved in this cartogram by supplying a key, in the lower right 
corner of the map. 

In the second kind of cartogram of this type dots of uniform 
size are used, each dot indicating an aggregate specified. When 
dots of this sort are used, they can be counted to figure out the 
total. Sometimes a dot is quarter or half shaded to indicate a 
quarter or half the amount of a full dot. When thus used, the 
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I’lc;. -f). Wholesale trade in the Unite d States, 1930, illustrating use of equal-sized dots. 
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Bulletin,"* February, 1940, p. 94.) 
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dots should be of sufficient relative size so that there will not be 
too many of them. An example is shown in Fig. 26 . 

The chief difficulty in the use of this kind of cartogram is the 
mechanical one of arriving at the proper magnitude to assign 



to each dot of uniform size. If the magnitude assigned to each 
dot is too large, it becomes difficult to show graphically the 
small quantities relating to geographical locations where the 
characteristic is scarce. On the other hand, if the magnitude 
assigned to each dot is too small, this results in too great a crowd¬ 
ing of the dots in areas where the characteristic is very plentiful. 
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In Fig. 26 , this is illustrated by the attempt to picture the 
volume of wholesale trade of the state of New York, compared 
with the rest of the country. The dots are so dense that it is 
hardly possible to count their number. While the general 



picture of relative densit}^ is quickly visualized from such a map, 
this purpose can be better served by the use of the point-dot 
map. Another objection to the dot of uniform size map for 
this particular purpose is that it may convey the impression 
that the concentration of wholesale trade is over the whole 
state of New York, whereas it is known to be concentrated in the 























EASTERN FLOW OF FREIGHT TRAFFIC, DECEMBER 13. 1933 
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Fig. 32.—Distribution of rubber manufacturing in three leading states in 1937, 
illustrating a dramatic use of a point-dot map. [Reproduced from Barker, P. W 
and E, G. Holt, Rubber Industry of the United States, 1938-1939 {Bureau of Foreign 
and Domestic Commerce), p. 20.] 
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metropolitan area of New York City. This conception is 
more clearly brought out by the use of the device of using large 
dots of varying size. 

The third type is the point-dot cartogram, in which each dot 
means a certain quantity, but the dots are so small that they 
cannot be conveniently counted. The significance lies in pre¬ 
senting the idea of relative density of dots. Figure 27 shows the 
concentration in the Southeast and the Northern Middle states 
of nonpar banks of the United States. 

Cartograms by Colors and Shades, Obviously, the same effect 
can be produced by the use of colors and shades as by the use of 
dots, but the former are expensive to reproduce in print and 
therefore are not extensively employed. The Statistical Atlas 
of the eleventh and twelfth censuses of the United States 
contains numerous such cartograms. 

Cartograms by Crosshatching. Making comparisons relating to 
geographical location by crosshatching maps has increased in 
popularity during recent years. It is more effective than the 
method of dots and is cheaper than coloring and shading. 
Figure 28 makes it easy for the reader to visualize the variation 
in different parts of the United States in the proportion of 
mortgaged owner-operated farms paying rates of interest as 
high or higher than 6.5 per cent. Figure 29 shows at a glance 
the variation from state to state in the percentage increase in 
nonagricultural employment from 1940 to 1943. 

Figure 30 is an interesting experiment in the combined use 
of a map and bar chart to show variation in the percentage 
increase in manufacturing employment in various metropolitan 
areas from 1940 to 1943. Figure 31 shows the use of a map and 
bars to depict flow of freight traffic in the United States. In 
Fig. 32 the geographical concentration of the rubber-manu¬ 
facturing industry in three states of the United States is 
dramatically emphasized by showing outline maps of only those 
three states. 
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STATISTICS—A STUDY OF VARIATION 

Ubiquitous Variability. Only in the abstract sense is there 
such a thing as a fixed quantity; in all cases, with reference both 
to physical and to psychic things, practical quantitative expres¬ 
sions are variahles. However fixed the true quantity may be, 
no human measuring device is capable of giving the exact 
quantity; hence, all measurements obtained are approximations. 
In both ph3"sical sciences and social sciences, the raw materials 
amenable to the techniques of statistics are quantitatively 
expressed variations. The methods of anal^’^sis are likel}^ to be 
complex when the scientist is faced with complex variability. 
This fact for the social sciences is recognized in the following 
quotation:^ ^^The social scientist is limited b\^ the fact that he 
does not deal with rational material but with the rational and 
irrational conduct of man. The host of variables which this 
fact introduces multiplies the obstacles to his work and sets 
limits to the applicabiliU" of results. 

USE OF SYMBOLS 

Simplification of the complex methods that need to be used in 
statistics is accomplished by the use of s^’^mbols. J^ecause sym¬ 
bols are used for various purposes, beginners ma}" have a natural 
psychological reaction unfavorable to the study of statistics. 
The uninitiated may be mystified and frightened away from the 
subject on account of the symbolic presentation. It is impor¬ 
tant, therefore, to realize that the symbols used in statistics are 
quite simple and that there are not very many of them. Fur¬ 
thermore, they are easily learned and remembered, as soon as 
their real purpose of simplification is understood. 

^ Fosdick, Raymond, B., A Review for 1939 —The Rocktfeller Foundation, 
pp. 41-42. This foundation contributes extensively to the support of 
research in many scientific fields; for example, it contributes to such researcli 
organizations as the Brookings Institution and the National Bureau of 
Economic Research, discussed in Chap. III. 
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The Variable X. The study of 'variation is the meat and 
bones of the craft. The variable X is not a new idea to anyone 
who has gone as far as a first course* in algebra and who has on 
man}" occasions said, ^4.ct A" equal , . . Symbols enter into 
statistical anal}-sis in only three ways: 

1. To represent variation in size with time; in such a case the 
data measuring the variable are designated ''time series/^ 

2. To represent variation in order of magnitude, from smallest 
to largest, or vice versa (if time is involved, it is disregarded, as 
the variable is rearrang(*d or reclassified upon the basis of mag¬ 
nitude); in such a case the data measuring the variable are 
designated "frequency series.^’ 

3. To represent variation in quality or attribute (for example, 
occupation, geographical location, or race). 

In symbolic language, it is purely a matter of convention that 
the variable may be referred to as X or as Y or as Z. In a given 
problem, if the nomenclature of X is assigned to a given variable, 
it is necessary to retain that symbol for that particular variable 
throughout the problem. In the theory of statistics conventions 
have arisen as to the use of symbols; for example, variables are 
commonly designated by the letters at the end of the alphabet, 
while constants or known figures are designated by the letters 
at the beginning of the alphabet. 

One convention widely followed is to use a bar over a letter 
to designate the arithmetic mean, so that A,- (read "Ai bar^^) is 
the symbol for the mean of a series of A^s. Another group of 
A's would be Ay and their mean Ay. The subscripts i and j, 
respectively, symbolize subgroups. For example, all the A^s 
may refer to the I.Q.’s of college freshmen; Xi refers to the I.Q.^s 
of male freshmen; and Xj refers to the I.Q.’s of female freshmen. 
Accordingly, Ay symbolizes the mean I.Q. of male freshmen, and 
Ay symbolizes the mean I.Q. of female freshmen. It is then 
conventional to designate the mean of all the A^s, both Ay and 
Ay, as A (called bar^O- 

Another commonly used convention is to designate an esti¬ 
mated figure by a letter followed by prime. According to this 
convention if an estimate is made of the value of A (for example, 
the coming crop yield of wheat based upon reports to the United 
States Department of Agriculture), the estimate is symbolically 
designated A'. Similarly, if an estimate of A (the jprice of 
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wheat, for example) is made from information on supply and 
demand data, it is called X'. The small Greek letter sigma (a) 
is used to designate standard deviation.^ A special estimate 
of the standard deviation is SA^mbolized by a. 

It is a common practice to use certain other Greek letters to 
symbolize statistics. Accordingly, /xi, /X 2 , /xs, . . . , Mn (the Greek 
letter mu) symbolize the series of statistics called “ moments 
about a mean of a sample. The symbol tt refers to the constant 
3.1416. The symbols 2;i, 2 ; 2 , . . . (the Greek letter nu) refer 
to moments about an arbitrary origin. 

While the use of symbols has become fairlj^ well standardized 
in some respects along the conventional lines indicated, complete 
uniformity and consistent systematization are far from realized. 
Even the simple conventions above enumerated are not uni¬ 
versally followed. Nevertheless, the student will find it an 
advantage to have his attention directed to these trends in 
symbolic representation. 


TIME SERIES 

Conventional Use of X and T to Symbolize Passage of Time. A 
convention in times-series analysis is that A" is used to refer to the 
passage of time. T is also used for this purpose.^ It happens 
that the same symbol, A, is conventionally used in geometry, trig¬ 
onometry, and the like, to refer to the horizontal axis in a plane. 
The unification of these two conventions results in the convention 
in statistics that, in making graphs of statistical time series, the 
A-axis (the horizontal axis) is used to represent the passage of 
time. Thus, the passage of time may be indicated by a series of 
A^s: Ai, A 2 , As, . . . , A„, as shown in Fig. 33, where A refers 
to years 

1 , ^ I I 

1940 1941 1942 1943 . . . 

Xi A 2 A 3 A* 

Fig. 33. 

or as shown in Fig. 34 where A refers to months. 

1 I I 1 I' 

Jan. Feb. Mar. Apr. . . . 

Ai A 2 Aa A 4 . . . 

Fig. 34. 

^ For further discussion of the standard deviation, see Chap. VI. 

2 See Chaps. XIX-XXIV. 
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As indicated, Ti, T 2 , ... ,Tn may also represent the passage 

of time. 

Lower-case letters x and t refer to deviation from the mean; 
that is, X — X = x; T — T = t. 

Where the Variable Fluctuates in Size with Time. When the 
statistician is dealing with a variable that fluctuates in size with 
the passage of time, he refers to this variable as Y. This is a 
convention; there is no logical reason for it except that he has 
already used the symbol A or T to refer to time and wants to 
have a different symbol for the variable being studied as it fluc¬ 
tuates through time. This situation is described in technical 
language by saying that the variable is a “ function of time, by 
which is meant merely that, as time passes, the variable fluctuates 
in magnitude, one way or another. The simple symbolic waj^ of 
saying exactly the same thing (where X refers to time and Y 
refers to the variable) is 

Y = F{X) 

There is nothing mysterious to be read into this expression. It 
is merel}^ a use, slightly different from the ordinary one, of the 
equality sign; and the whole expression means that F is a func¬ 
tion of X, or the variable which is being studied is a function of 
lime, meaning that it fluctuates with the passage of time. This 
may be illustrated by one or two examples, imaginary figures 
being used. 

Time Passes in 1944 

The unit that constitutes the variable is the price of sugar per pound in 
the \ew York City market (average for the month of prevailing daily prices). 



X 

Y 


A'l 

January 

Fi 

3 cents 

A% 

February 

F2 

2 cents 

A', 

^larch 

Fs 

4.3 cents 

A'4 

April 

F4 

5 cents 

X 5 

May 

F 5 

4 cents 

ATe 

June 

Fe 

2.8 cents 


Thus Xi is the first unit of time (January), and Fi is the 
measurement of the variable F at that time according to the 
designated unit of description; in other words, Fi is the price in 
January. Similarly, F 2 is the price in February (Z 2 , or the 
second unit of time), and so on. 
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The unit of time may be the week, as where the unit that constitutes the 
variable is the amount of rainfall in inches in New York City per week. 


X 

First week 
Second week 
Third week 
Fourth week 


Y 

0.1 inch 
4.0 inches 
0.3 inch 
0 .7 inch 


In this illustration, Xi refers to the first week, X 2 to the second 
week, etc., while Fi refers to the inches of rainfall in the first 
week, F 2 to the inches of rainfall in the second week, etc. 

The unit of time may be the year, as where the unit that constitutes the 
variable is the net worth of a business enterprise on Jan. 1 of each year. 


Y 

F 

1936 

S20,001.00 

1937 

$28,546.00 

1938 

$21,527.00 

1939 

$20,250.00 

1940 

$27,430.00 

1941 

$35,240.00 


It is customary in geometry, trigonometry, etc., to let the 
vertical axis represent the F variable; fluctuations in Y are shown 
by vertical distances. The unification of this custom with 
statistical presentation results in the convention that, when a 
graph is made of a variable that is a function of time, fluctuations 
in the F variable are shown by vertical distances while time 
change is indicated along the X-axis, or horizontally. 

Figure 35, showing comparative changes in cash farm income, 
farm-mortgage debt, and value per acre of farm real estate for 
years 1910-1942, is an illustration of the graph of a time series. 

Careful Description of Units Involved, One or two matters 
concerning the units involved in time series should be noted. 
Sometimes the variable refers to an average value over a specified 
period of time; in the first illustration above, the average price 
of sugar per pound in New York City is an average over a period 
of a month. In other instances, the variable refers to a total for 
a given period of time; in the second illustration above, the inches 
of rainfall are given by totals per week. In still other problems, 
the variable refers to a quantity at the beginning of a period of 
time or at the end of a period of time; in the third illustration, 
the net worth of a business enterprise on Jan. 1 of successive 
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years was used. In Fig. 35, cash farm income is in totals for 
calendar years, each year’s total being expressed as a percentage 
of the average 1910-1914 yearly income. Farm-mortgage debt 
is in amounts as of Jan. 1 each j^ear, expressed as a pei’centage 
of the average 1910-1914 annual amounts. Value per acre of 
farm real estate is in amounts as of Mar. 1 each year, expressed 
as percentages of the average 1912-1914 annual amounts. 

It is important in connection with the study of time series to 
know exactly how the ^'ariable is being used. Of equal impor- 



U S DEPARTMENT OF ACRICULTURE BUREAU OF AGRICULTURAL ECONOMICS 

¥ig. 35.—Ciish farm income, farm-mortgage debt, and value per acre of farm real 
estate, index numbers. United States, 1910—1942. 


tance is it that exact indication of this should be given. Every 
good statistician invariably indicates either in titles of tables or 
in footnotes just what his variables mean. He should do this 
no matter how expert a statistician he is and no matter how clear, 
without such explanation, his work may seem to him. 

Cumulative and Noncumulative Data. Another important 
matter is the difference between cumulative and noncumulative 
data in time series. The fundamental distinction between 
cumulative and noncumulative data is really the difference 
between data of ‘‘condition” and data of “change.” Cumula¬ 
tive data are the data of change. It is possible to add the data 
on weekly rainfall and thus obtain data on monthly rainfall 






128 


INTRODUCTION 


or yearly rainfall. Sales of a store by the week can be added to 
get sales by the month or by the year. It is possible to cumulate 
the number of births daily in order to get the total number of 
births per month or per year. Income and outgo figures are 
cumulative data. To those Avho have studied accounting, a 
convenient analogy is to the profit and loss statement—figures 
in the profit and loss statement in the main represent cumulative 
data. 

Noncumulative data are those describing a condition and are 
not subject to the additiA e treatment. The average price of 
sugar per AA’eek cannot be cumulated to obtain the average price 
of sugar per month or per year. It is necessary to resort to 
averaging. The daily figures on population cannot be added 

in order to get the monthly 
population figures. A balance 
of $3,000 in the bank in January 
and of $5,000 in March do not 
gwe you a balance of $8,000 for 
the tAvo months. These are 
items of condition and cannot be 
added. In order to obtain sig¬ 
nificant summary results in the 
case of noncumulative data over several periods of time, it is 
necessary to average rather than to add. 

The method of averaging is applicable, not only to the non- 
cumulatwe, but to the cumulative type of data. It is significant 
to speak of the average daily rainfall during a given month or 
year, or the a\^erage Aveekly rainfall during a given month or 
year, or the average Aveekly sales of a given year, etc. 

Another Avay of referring to a time series is to describe it as the 
situation in Avhich a variable is classified according to the time 
of its occurrence. The basis of classification is time; and the 
most logical arrangement of the data in question is that basis. 
As Avill be seen, the data of a time series may be reclassified, for 
certain purposes, upon a different basis, and A\^hen this is done 
they no longer constitute a time series. 

Charting Time Series, When a time series is graphed, the 
X-axis is used to represent passage of time, Avhile the F-axis is 
used to represent varying magnitudes. Thus, in Fig. 36, the 
points plotted Avould represent a magnitude equal to 2 in 1940, 



Fig. 36.—Chart of a time series. 
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equal to 1 in 1941, and equal to ^ in 1942, rising to IJ in 1943. 
It is conventional to represent time series by lines or curves 
connecting the plotted points. In graphic phraseology these 
lines may be drawn through the plotted points as polygons (e.g.. 
Fig. 36), or the changes in direction may be curyed. 

Two kinds of charts are in general use for the graphic presen¬ 
tation of time series: (1) arithmetic charts and (2) ratio charts. 

ARITHMETIC AND RATIO CHARTS 

Arithmetic Charts. The arithmetic chart pictures arithmetic 
changes in magnitude. For illustration, in Fig. 37 is shown a 



Fig. 37.—Constant growth and con- Fig. 38.—Showing effect of 

stant rate of growth. omitting zero line. 


variable magnitude represented by the line ^4/1', increasing by 1 
during each time interval. This produces a straight line. On 
such a scale any variable increasing at a constant rate would give 
a straight line; but any variable increasing at a constant relative 
rate would produce an ever-steeper curve. This is illustrated by 
BB\ Avhich shows a magnitude doubling in each interval, that is, 
increasing at a constant rate. 

The significant comparison in such a chart is always with zero, 
and hence the zero line should invariably be included in the chart. 
Leaving out the scale between zero and the point where the 
curve reaches its lowest point will give a deceptive appearance 
to the changes that occur. This is illustrated in Fig. 38, where 
p 2 is really 1| larger than Pi (see scale) but appears in the figure 
to be twice as great because only part of the vertical scale is 
shown. 

An arithmetic chart may also be a graph of relative figures, 
in which change from time to time relative to some base is 
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pictured. Such a graph is Fig. 35. In this kind of graph, the 
base is usually called arbitrarily 100 per cent and the relative 
changes above and below that base are graphed as percentages 
of it. Figure 39 shows a magnitude at 105 in 1941 (5 per cent 
above the base), at 95 in 1942, at 90 in 1943, and at 105 again in 
1944. It is an extensive practice to convert time series into 



Fig. 39.—Chart of time series in relatives. 


relatives, using some particular point in time as the base; and 
when such relative series or ^indexes (as they are sometimes 
called) are charted, the chart assumes the form indicated in 
Fig. 39. The point of departure for reading such a chart is the 
100 per cent line, which should be emphasized—the zero point 
does not have to be shown on such a chart. The relative chart 



Fig. 40.—The percentage changes in the prices of 354 industrial stocks (1935- 
1939 = 100). {Survey of Current Business, Weekly Supplement, Apr. 29, 1943.) 

should not be confused witli the case in which raw data are 
already in the percentage form and the zero per cent may be 
the significant point of departure rather than 100. Thus the 
raw data may be percentages of population paying income taxes 
in successive yearn. In such a case the zero line is important, 
the raw data themselves being in percentage figures. 
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Figure 40 is an illustration of a graph of a relative time series. 
It shows, the month-to-month variation, compared with the 
average of monthly figures, 1935-1939, in the prices of 354 
industrial stocks; thus the average 1935-1939 equals 100 per cent. 

Ratio Charts, The second type of chart for graphing time 
series is the ratio chart, which is designed to picture relative rate 
of change. According to Wesley C. Mitchell, the idea of the 
ratio chart was introduced by Jevons in 1863-1865.^ But the 
ratio chart did not come into general use until its advantages 
were explained by Prof. Irving Fisher and James A. Field, in 
1917.2 



The great popularity in recent years of the ratio chart has 
been largely due to the fact that special graphing paper has 
been made for the purpose, the work of making such a chart being 
thus vastly simplified. 

In the case of the arithmetic chart, equal rises on the chart 
per unit of time represent a constant rate of increase—in the 
case of the ratio chart, equal rises per unit of time represent 
a constant relative rate of increase. This is illustrated by the 
comparison of the left with the right scale in Fig. 41. This 
figure is a simple illustration showing a magnitude changing 
at the same relative rate, BE', and a magnitude changing at 
a constant rate, AA', both plotted on a ratio scale. The BB' 

^ Mitchell, W. C., Business Cycles, p. 209. 

® Fisher, Irving, ‘‘The ‘Ratio' Chart for Plotting Statistics," Publica¬ 
tions of the American Statistical Association, Vol. 15 (June, 1917), pp. 577- 
601; Field, James A., “Some Advantages of the Logarithmic Scale," 
Journal of Political Economy, Vol. 25 (October, 1917), pp. 805-841. Cited 
from Mitchell, op, cit,, p. 209, 
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magnitude doubles in each time period. The A A' magnitude 
increases in each time period by the constant difference of 4. 

Notice the scale of logarithms at the right, which corresponds 
to the scale of natural numbers at the left. These logarithms 
are to the base 2. Thus, the log 2 of (>4 is 6 because 2® = 64; 
log 2 of 32 is 5 because 2^ = 32; etc. It is evident, of course, that 
while the scale at the left is in geometric progression the scale 
at the right is in arithmetical progression. This is a character¬ 
istic of ratio paper. Ratio charts have no zero line, and there 
is no point of emphasis. The attention is directed to the shape 
and fluctuations in the curve. In the case of the arithmetically 
ruled chart, growth at a constant difference is a straight line— 
the greater the difference, the steeper the line—but it is still a 
straight line if the difference is constant. In the case of the 
ratio chart, growth at a constant relative rate is a straight line— 
the greater the constant relative rate, the steeper the line—but 
it is still straight. 

On arithmetical paper, changes in differences produce curves 
or irregular lines. On ratio paper, changes in relative rates of 
change produce curves or irregular lines. The vertical scale of 
the arithmetical chart is an arithmetic progression. The vertical 
scale of the ratio chart is in geometric progression; but the 
logarithms of the natural scale on a ratio chart are in arithmetical 
progression. For this reason, the ratio chart is often called the 
semilogarithmic chart. One method of plotting a ratio chart is to 
find the logarithms of the raw data and then plot the logarithms 
on arithmetically ruled paper. The results are the same as if 
the natural data were plotted on a ratio scale. The labor of 
looking up logarithms is avoided by having the scale made into 
a logarithmic one, upon which the plotting of natural data will 
produce the same effect as if the logarithms were found and 
plotted. This is shown in a very simple case in Fig. 41, in 
which the scale in logarithms is at the right and the scale in raw 
data units is at the left. As already explained above, the line 
BB' represents a variable that increases at a constant relative 
rate, while the line A A' represents a variable that increases by a 
constant quantity. In Fig. 41 the vertical distance between each 
of the scale markings on the left represents just double the 
absolute amount of the same vertical distance immediately 
below it and just half the absolute amount of the same vertical 
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distance immediately above it. In this figure the variable that 
doubles every year follows the straight line BB' (it was a curve 
in Fig. 37). A variable that increases by the same aggregate 
amount each year and hence follows a straight line in Fig. 37 
would follow a curved path on a ratio chart, such as line A A' of 
Fig. 41. 

Since the logarithm of the ratio between two quantities is equal 
to the difference between their logarithms, ratio paper can be 
easily “calibrated” by the use of a logarithmic scale. Thus, if 
eciual vertical distances are taken to measure equal aggregate 
differences between logarithms, then these same vertical distances 
will represent equal relative distances (equal ratios) between the 
antilogarithms of the logarithmic scale. In Fig. 41, for example, 
the unit vertical distance is taken to be a unit difference between 
logarithms to the base 2, and the logarithmic scale on the right 
reads, 2, 3, 4, etc. Since the antilogarithm of a number to 
the base 2 is equal to 2 raised to the log 2 power, the antilogarithms 
of the logarithmic scale become 1, 4, 8, 16, etc. This is the 
scale shown on the left. It is evident that while the scale on 
the right is in arithmetic progression the scale on the left is in 
geometric progression. Accordingly, if paper is ruled so as to 
be in arithmetic progression with respect to some logarithmic 
scale but is marked or calibrated in terms of the antilogarithms 
of the logarithmic scale, any variable plotted on this paper in 
accordance with the antilogarithmic scaling will indicate a con¬ 
stant rate of growth or decline wherever it traces out a straight 
line. 

Most ratio paper is ruled in accordance with a logarithmic 
scale to the base 10, since this is the base of common logarithms. 
An example of this kind of “semilogarithmic paper” (as it is 
often called because the vertical scale is logarithmic while the 
horizontal scale is arithmetic) is sho\\Ti in Fig. 42. The reason 
common logarithms are to the base 10 is that numbers are 
arranged upon a decimal system and, by taking the base 10 for 
logarithms, the integral part of the logarithm (characteristic) is a 
mere record of the position of the decimal point in the original 
number. The number 10 raised to the zero power is 1, and so the 
logarithm of 1 is zero; the number 10 raised to the second power 
is 100, and so the logarithm of 100 is 2; the number 10 raised to 
the third power is 1,000, and so the logarithm of 1,000 is 3; and 
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SO on, indefinitely. Likewise any number between 1 and 10 will 
have a logarithm (to the base 10) whose characteristic is 0; any 
number between 10 and 100 will have a logarithm whose char¬ 
acteristic is 1; etc. The fractional part of a logarithm (its 
mantissa) is the same for all similar successions of similar digits. 
The fractional part of the logarithm to the base 10 for the number 
2 is the same as the fractional part of the logarithm for 20 or 
200 or 2,000, etc., namely, 0.3010; but the characteristic of the 
logarithm of 2 is 0, the characteristic of the logarithm of 20 is 1, 
the characteristic of the logarithm of 200 is 2, and so on. Thus, 
the entire logarithm of 2 is 0.3010; the entire logarithm of 20 is 
1.3010; the entire logarithm of 200 is 2.3010; etc. Hence, when 
the base of the logarithm is 10, logarithmic markings of —2, —1, 
0, 1, 2, 3, etc., represent antilogarithmic markings of 0.01, 0.1, 
1, 10, 100, 1,000, etc. 

Semilogarithmic paper to the base 10 is usually ruled to 
represent either one logarithmic unit and the fractional parts 
thereof, corresponding to equal tenths on the antilogarithmic 
scale (called ^^one-cycle paper'O, or two logarithmic units and 
the fractional parts of each, corresponding to equal tenths on 
the antilogarithmic scale (called ‘Hwo-cycle paper^^), or three 
logarithmic units and the fractional parts of each, corresponding 
to equal tenths on the antilogarithmic scale (called ^Hhree-cycle 
paper^’). All three of these types of logarithmic rulings are 
shown in the right part of Fig. 42. Since the logarithmic scale 
is in arithmetic progression, these rulings would be the same for 
any logarithm differing by one, two, or three units; they \vould 
apply to logarithms running from —2 to 0, as well as from 0 to 2. 
Thus the corresponding antilogarithmic scale can be selected 
by the statistician in accordance with his needs. If his data run 
from 2 to 800, for example, he Avould select three-cycle semi¬ 
logarithmic paper and make his scale as indicated on the left 
of Fig. 42. If his data ran from 200 to 80,000, he would also 
select three-cycle semilogarithmic paper and make his scale from 
100 (at the bottom) to 100,000 at the top. If his data ran from 
0.2 to 8, he would choose two-cycle semilogarithmic paper and 
make his scale from 0.1 (at the bottom) to 10 (at the top). 

Figure 42 is an illustration of a three-cycle ratio scale for the 
plotting of a time series by months for 6 years. The scale as 
drawn reads from 1 to 1,000, but it could be made to read from 
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10 to 10,000, or from 100 to 100,000, etc. At the right of the 
figure are shown the three most generally used types of ratio 
scales, the three-cycle ratio scale, the two-cycle ratio scale, and 
the one-cycle ratio scale. If the extreme fluctuations of a time 
series are 60 and 3,000, it would be necessary to use three-cycle 



Fig. 42.—Three-cycle semilogarithmic paper. 


paper; on the other hand, if the extreme fluctuations are 60 to 
500, it would be necessary to use only two-cycle paper. 

Figures 43 and 44 are intended to illustrate the advantages and 
disadvantages of the ratio chart. Figure 43 shows the com¬ 
parative growth of some famous cities of the United States on an 
arithmetic scale, and Fig. 44 shows the same data plotted on a 
ratio scale. These data are also shown in Table 7. It \vill be 
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noticed that on an arithmetic scale it is not possible to bring the 
New York City population growth curve into the picture. On 
the ratio paper this is possible. Of course, on the arithmetically 
ruled paper New York City population could be plotted on a 
different scale; but then the arithmetic comparison between 
New York City and the other cities would be lost, since the 
height of the curve from the zero line is what counts in the com¬ 
parison on arithmetic paper. 



0 I-P-—f--1 / I _ I I I I 

1790 1810 1830 1850 1870 1890 1910 1930 1950 


Fig. 43.—Growth of certain cities in the United States (arithmetic scale). 

The advantage of the ratio chart is threefold. (1) It makes 
possible a quick answer to the question as to whether a magnitude 
is changing its rate of growth. (2) It clearly pictures the rela¬ 
tive significance of fluctuations—for example, arithmetic dif¬ 
ferences of small magnitudes appear as important as the same 
relative differences of large magnitudes. On an arithmetic chart 
the latter would appear, much larger. If an arithmetic chart of 
almost any item of production in the United States, say from 
1800 to 1940 by years, is constructed, the fluctuations in the 
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curve for the earlier period will be minute, while the fluctuations 
in the curve for the latter part will loom very large. In such 
casesthe inclusion has therefore sometimes Seen reached that 
instability is greater now than formerly. Plotting the same data 
on ratio paper would in most cases show that the earlier fluctua¬ 
tions were relatively as great as or greater than the modern 
ones. (3) It facilitates comparisons between time series in ordei’ 
to detect correlation between them. 



1790 1810 1830 1850 1870 1890 1910 1930 1950 

Fig. 44.—Growth of certain cities in the United States (logarithmic, or ratio, 

scale). 

The disadvantage of the ratio chart is that it is not possible 
to make magnitude comparisons. For illustration, if the 
attempt were made to compare the actual size of Trenton, N.J., 
and New York City in 1930, an entirely incorrect impression 
would be created—Trenton would appear from the ratio chart 
to be about half as large as New York City in 1940 if vertical 
distance were assumed to be magnitude. When the ratio chart 
is used, such magnitude comparisons must be made by the use 
of the raw figures themselves, which should always be given in a 
table of figures along with the chart. 
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Table 7.—Population of Specified Cities in the United States from 
Earliest Census to 1940 
(In thousands) 


Census 

date 

Trenton, 

N. J. 

Portsmouth, 

N. H. 

Omaha, 

Neb. 

New York 
City* 

■M 


4.7 


49.4 



5.3 


79.2 



6.9 


119.7 


3.9 

7.3 


152.1 


3.9 

8.0 


242.3 


4.0 

7.9 


391.1 


6.5 

9.7 


696.1 


17.2 

9.3 

1.9 

1,174.8 


22.9 

9.2 

16.1 

1,478.1 

1880 


9.7 

30.5 

1,911.7 

1890 

57.5 

9.8 

140.5 

2,507.4 


73.3 

10.6 

102.6 

3,437.2 


96.8 

11.3 

124.1 

4,766.9 

1920 

119.3 

13.6 

191.6 



123.4 

14.5 


6,930.4 


124.7 

1 14.8 

223.8 

7,455.0 


Source: Sixteenth Census of the United States, 1940, Vol. I, Population, pp. 32 and 660. 
1 Refers to New York City and its boroughs as constituted in 1940. 


FREQUENCY SERIES 

Definition of a Frequency Series. A convenient arrangement 
of any set of data is a classification according to magnitude, 
that is, from smallest to largest. In the case of a time series, 
time seems to be the most logical and workable basis of classifica¬ 
tion, because it seems reasonable to view things as they occur in 
time. There is a rationality about such a procedure. But 
another aspect of data, unrelated to time, may be important. 
For example, how many different prices of sugar during a given 
week differed from the average price for that week, and in what 
respect did they differ, or from how wide a range of fluctuations 
in price during the week were the respective average weekly 
prices calculated? This particular aspect would have no 
reference to time, except as a matter of definition of the unit 
involved (one would not take prices of the third week in March 
to study the average price in the first week of March). When 
the arrangement of data according to time of occurrence is not 
significant, it seems rational to classify the data in a series from 
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smallest to largest. When this is done, the resulting series of 
data is called an “array.’* 

Following is an example of an array 

An Array of 10 Children in Third Grade, by Agb 

4 ^ ' 


Xi 

Age, 

Years 

71- 

X. 

75 

X 3 

7-1 

X 4 

8 

X 5 

81 

Xo 

Sj 

X; 

81 

x« 

9 

Xo 

91 

Xio 

91 


Variable X arranged according to magnitude, where X = age 
of children in third grade, Xi = age of youngest child, etc., 
until A"io is age of oldest child. 

The situation may be one where there are a number of children 
of each age, for example: 


An Array of 18 Children in Third Grade, by Age 


A'.. 

X3 

X 4 

A'c 

Xti 

A'v 

Xs 

Xo 

X10 
A',. 
A,., 
Au 
A.4 

A,6 
A 16 
X,-. 
Aia 


Age, 

Years 

!1 

el h 

6J 3 

, 

7 
7 
7 

7 
71 
7i 
7i 
7h 
7h 
71 
71 

8 


•The particular magnitude here taken is “age.” Any other common 
characteristic could be taken as the magnitude for comparison of the chil- 
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From the array, it is noticed that there is 1 child G| years old, 
and there are 2 children years old, 3 children 6f years old, etc. 

Inasmuch as there are really only eight variations of the 
variable X, some of which occur more than once, the above is 
more conveniently summarized as follows: 


Number of Children of Specified Age, among 18 Children in Third 

Grade 

Age, Years, of Children Number of Children 
in Third Grade of Specified Age 


X 


F 


Xi 

Xo 

Xa 

X4 

X5 

X« 

X: 

Ys 


61- 

6^ 

6i 

7 
71 
Th 
71 

8 


1 F, 

2 


3 

4 
3 
3 
1 

IS 




F, 

F, 

F, 

F, 

Fj 

h 

:f:F 


This is called a “frequency series’’ or a “freciuency distribu¬ 
tion”; the variable is listed in a column in the form of an array, 
and in a second column the frequencies of each variation are 
set down. It is merely a condensed form of the array and is 
particularly convenient, as may be readily imagined, when a 
large number of cases is studied. It will be noticed that a new 
symbol is introduced, but it is a very simple one and one that 
readily suggests itself. Fi refers to the number of times A"i 
occurs, F 2 the number of times X 2 occurs, etc. F stands in 
general for the frequency of occurrence of a variation; 18 is the 
total number of cases and is therefore the sum of the F’s, and 
this is written SF. {Fi + F 2 + Fz + • • • + Fn = ^F.) How¬ 
ever, a more general way to symbolize the total number of 
cases is to use a large N, Either 2F or N could be used, but 
it is conventional in statistics to use N to represent 2F. This 2 
is the capital Greek letter sigma, and it is always used in statistics 
to designate “sum” or “total of.” 

Nature of a Frequency Distribution and Illustration, The idea 
of the array and of the frequency distribution in its barest 

dren, for example, height or weight. The basis must be a common charac¬ 
teristic or attribute that is a variable magnitude capable of quantitative 
measurement. 
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simplicity has been illustrated. From the example, it is seen 
that the frequency distribution is merely the commonplace and 
rational arrangement of a set of data in order of magnitude. 
As indicated elsewhere, this form of arrangement discloses a 
natural order that appears to persist in all things,^ namely, 
that in a large number of observations of a common characteristic 
of a thing the following tendencies exist: 

1 . A large number of frequencies cluster about a central 
magnitude or average, which occurs most frequently. 

2. Small variations above and below this central magnitude 
are numerous. 

3. Large variations are much less frequent. 

4. Extreme variations are rare. 

Following is an example of a frequency distribution showing 
the number of cities of 100,000 or more population that have 
specified death rates from puerperal causes: 

Table 8.—Maternal Mortality in Cities op 100,000 or More 
Population in the United States, 1938 


Death Rates 
(Number per 1,000 

Live Births) 

Number of 

X 

F 

1- 

2 

2- 

16 

3- 

18 

4- 

20 

0 - 

15 

6- 

10 

7- 

4 

8- 

6 

9- 

0 

10- 

2 


93 

Source: Bureau of the Census, “Vital Statistics,” Special Reports, Vol. 9, No, 7 (Feb. 10, 
1940), pp. 125-126. 

The average maternity death rate for these 93 cities is 4.8 per 
1,000 live births. It will be noted that, instead of writing Zi, 
Z 2 , Z 3 , . . . , for each variant of the variable Z, the symbol Z 
is written at the head of the column, indicating that the column 
consists of Zi, Z 2 , Z 3 . . . Z„. The symbol F is handled in a 
similar manner. Furthermore, in this illustration, class intervals 

' See Chaps. VI and VII. 
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of 1 are used, which is signified by the dash after each of the 
numbers in the X column. This is because fractional rates are 
given in the source and not merely rounded numbers. For 
example, the death rate from puerperal causes in 1938 in the 
city of Akron, Ohio, was 4.0; in the city of Albany, N.Y., it was 
3.1; in the city of Atlanta, Ga., it was 4.4. Since the death 
rates are given to one decimal place, if class intervals were not 
used for the frequency table, it would require some hundred or 
more rows of figures to place the death rates in an array. The 
symbol for the class interval is f. In this case, i = 10 decimal 
units, or 1. The average 4.8 was calculated by assuming that 
cases in any class interval all had the value of the mid-point 
of the interval.^ 

Discrete and Continuous Frequency Series. A discrete fre¬ 
quency series is one in which the units of measurements are 
more or less fixed by the character of the data. The phenomena 
actually occur in such a manner that their variations in size 
proceed by distinct jumps or steps. The unit of measurement is 
fixed by this fact. An example of such a series is a frequency 
distribution of interest rates, in which the quoted variations 
in rates are likely to fluctuate by i or ^ per cent jumps and 
there are few if any intermediate variations. The variation 
in the range of the actual case.s is consequently by distinct steps 
of X or ^ per cent. The variation throughout the range is not 
by infinitesimal amounts. The very character of the data 
determines the unit of measurement and its degree of refinement. 
Where variation proceeds in this manner, by discrete steps of 
considerable magnitude as compared with the whole range of 
variation, it is probably best not to use a class interval. If the 
number of different values of X that occur are too numerous for 
convenience, however, then the data may be grouped into class 
intervals. Great care should be employed in this case to see 
that the class intervals are chosen so that the possible values of 
X are placed in a balanced position throughout the intervals. 
For example, if values of X occur at 0, 2, 4, 6, 8, etc., then, if 
grouping is desired, a class interval of size 4 might be chosen 
running from 1 up to but not including 5, from 5 up to but not 
including 9, etc. These would balance the actual X values 

^ For a more complete discussion of the class interval and calculation of 
averages, see Chap. VII, 
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around the center of each interval. On the other hand, intervals 
of 4, running from 0 up to but not including 4, from 4 up to but 
not including 8, etc., would result in the actual X values occur¬ 
ring at the lower limit and middle of each interval, causing an 
upward bias if the cases are assumed to be concentrated at the 
mid-points of the intervals, as is usual. ^ If the discrete data vary 
by steps that are small in relation to the range of variation in the 
data ie.g.y in steps of 1 cent over a range of $100), then the data 
might reasonably be treated as if they were continuous. 

A continuous series is one representing a phenomenon that 
varies by infinitesimal amounts. It may have the appearance 
on the statistical table of the same discreteness as the discrete 
series; but this is because the arbitrarily discrete character of 
the unit of measurement eclipses the actual continuous character 
of the data. Jn a continuous series the range of the interval 
is obtained by a process of testing and finding the one that 
appears best to smooth the data, following the general rules for 
determining the class interval discussed later. ^ Frequency 
series of all growth phenomena are of the continuous type. 
For example, the frequency distributions of weights or heights 
of people of some specified age are continuous in character. 
In passing from one height to another, the individual must 
necessarily pass through every minute difference between; and 
accordingly in measuring the heights of individuals at the same 
age (or of mature people) the variants ^^dll be by minute or 
infinitesimal differences. The units of measurement, howe\'er, 
will make them appear discrete in character. 

Charts of Frequency Distributions, A frequency table is the 
presentation of a series of variable magnitudes, usually arranged 
from smallest to largest, in such a manner as to record the fre¬ 
quencies of the different magnitudes. For purposes of graphing 
it is conventional to use the .r-axis for the variable magnitude 
and the y-axis for the frequencies. For illustration, in Fig. 45, 
the a:-axis shows the variations of magnitude (death rates from 
puerperal causes in 1938) and the y-axis the frequencies (the 
nuinljer of cities of 100,000 or more population) of those death 
rates—so that the points appearing from the left to the right 
signify the following: 

' Cf. Chap. VII. 

* See Chap. VII. 



144 


INTRODUCTION 


Death rates in 1938 from puerperal causes: 

2 cities have death rates between 1 and 2 per 1,000 live births. 

16 cities have death rates between 2 and 3 per 1,000 live births. 

6 cities have death rates between 8 and 9 per 1,000 live births. 

0 cities have death rates between 9 and 10 per 1,000 live births. 

2 cities have death rates between 10 and 11 per 1,000 live births. 

The points are plotted over the mid-points to indicate that 
the frequencies cover the class interval and not merely the 
rounded quantities sho^vn on the scale. Accordingly, Fi, or 2, 



is plotted directly over 1.5; F 4 , or 20, is plotted directly over 4.5; 
etc. It is easily seen from the figure that the peak of the fre¬ 
quencies is in the interval containing the average. It can also 
be seen that numerous small variations from the average occur, 
but large variations from the average are few in number—that is, 
the frequency polygon slopes rapidly downward on each side 
of the average where the frecpiency is highest. Variations of 
1 below average death rate (death rate of about 3.8) lie in the 
class interval having 18 cases; variations of 1 above average 
death rate (death rate of about 5.8) lie in the class interval having 
15 cases. Variations of 3 below and above average are much 
less frequent—only 2 cases are in the class interval containing 




STATISTICS—A STUDY OF VARIATION 


145 


death rate 1.8, and only 4 cases are in the class interval containing 
death rate 7.8. 

Instead of a polygon to trace the direction of frequencies, 
the practice of using bars to depict frequency distributions is 
often followed. Figures 40 to 48 are illustrations of such graphs 
of frequency distributions. It is possible also to fit a curve to 
the points either by freehand or by mathematical means and 



liG. 46.—Distribution of 617 wholesale price items by percentage of price 
change, 1926-1929. ^{National Resources Committee, The Structure of the Americayt, 
Economy, Part I, pp. i28 and 131.) 


thus describe graphically the frequency distribution by a curve, 
which is called a ‘^frequency curve.’^^ 

In Figs. 46 to 48 it is interesting to compare the concentration 
of percentage changes in the three different periods, namely, 
1926—1929, when prices and economic activity were compara¬ 
tively stable; 1929-1932, when prices and economic activity 
were on the decline; and 1932—1937, when prices and economic 
activity were increasing. Figure 46, depicting the distribution 
of percentage price changes, 1926-1929, is quite symmetrical, 
and the slope on each side of the maximum frequency is rapid; 
the position of the mean (wholesale price index for all commodi- 
1 See Chap. VI. 
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ties, or —4.7) is close to midwaj^ between the two extreme ranges 
of the variable. In Figure 47, ‘however, there is no such sym¬ 
metry. On the contrary, there is a piling up of cases in the 



Percent/ncreose in price 


Fig. 47.—Distribution of 617 wholesale price items by percentage of price 
change, 192f>-,1932. {National Resources Committee, The Structure of the American 
Economy, Part I, pp. 128 and 131.) 


negative direction so that the slope to the left of the maximum 
frequency is gradual while the slope to the right is parabolic; the' 
distribution appears to have a tail in the negative direction. 



Per cent choinge in price 

Fig. 48.—Distribution of 617 wholesale price items by percentage of price 
change, 1932-1937. {National Resources Committee, The Structure of the American 
Economy, Part I, pp. 128 and 131.) 

Figure 48, on the other hand, shows the opposite tendencies, 
with the appearance of a tail extending in the positive direction. 

Figures 49 and 50 illustrate the use of frequency curves in 
chemical studies. 
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Figures 61 and 52 are illustrations of the use of frequency 
histograms in biochemical studies. 

While the frequency distribution in Fig. 45 is in the form of a 
polygon, those of Figs. 46 to 48 and 51 and 52 are shown by 
outline bars. When a frequency distribution is di^awm with bars, 
the graph is called a “histogram.” 



Cifr^lidenecrofonalckhydc a, 
ylf-Ion’^iideneacetafcllehydie a. 
Citrylfdenecrcfonatc/ehycie a. 
semicarbazone 
lif-Ionyltdeneaceialolehyde a. 
semicarbazone 


CifryFoknecroionaidthydc b. 
’ty-lonyffdeneacefatdeh^e b. 
Oirylidenecrofonalokhydc b 
sernicarbazone 
/onyhdeneacefafdebyde b 
scmKc/rbazone 


Fig. 49. Fig. 50. 

Figs. 49 and 50.—Analysis of the semicarbazone, melting point 178-179°, 
proved it to be derived from an aldehyde C 1 &H 22 O. The position of its absorption 
maximum at 3250 A. and that of the free aldehyde (3150 A.) regenerated on 
hydrolysis with phthalic anhydride are in excellent agreement with the positions 
found for citrylidenecrotonaldehydes and their semicarbazones. [Burraclough, 
E., J. W. Baity, I. M. Heilhron, and W. E. Jones, Studies in the Polyene Series, 
Part I,” Journal of the Chemical Society {London), October, 1939, p. 1551.1 


Frequency Distribution Plotted on a Ratio Scale. At an earlier 
point in this chapter (page 131) the effect of plotting a time series 
on a ratio scale (scmilogarithmic paper) was discussed. For 
some purposes the use of similar paper for the plotting of a 
frequency series is desirable. Figure 53 shows the effect of 
plotting on a i*atio scale the frequency distribution showing the 
number of cities having specified death rates from puei’peral 
causes. The frequency distribution when plotted on the 
arithmetic scale as shown in Fig. 45 appears to be unsym- 
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metrical, with a steep slope on the left side and a gradual slope 
on the right side. The use of a ratio scale for the variable 



Fig. 51. Fig. 52. 

Figs. 51 and 52. —Showing distribution of daily Ca excretion for groups of rat.s. 
Figure 51 shows results of 792 determinations of urinary Ca (mg./lOO g./24 hr.) 
under standard conditions. Figure 52 shows differences (283 values) between 
test- and arbitrarily selected control groups. In both cases the results correspond 
with a normal distribution. [Truszkowski, R., J. Blauth-Ojnenska, and J. 
Iwanowaka, “Parathyroid Hormone,’* The Biochemical Journal {London), VoL 33 
(1939), p. 1007.1 



Fig. 63.—Death rates in 1938 from puerperal causes. (C/. Fig. 45.) 


magnitude (continuing the use of an arithmetic scale for the 
frequencies), as illustrated in Fig. 63, has reduced this contrast to 
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such an extent that the slopes on either side are almost the same 
and the frequency polygon appears to be almost symmetrical. 

An interesting application of logarithmic frequency-dis¬ 
tribution analysis has recently been made in entomology,^ by 
C. B. Williams, who says: 

Mr. Yule shows that the frequency distribution of sentence length 
(i.e., number of words between successive full stops) is of the skew type 
and by comparing two different manuscripts ... he is able to produce 
convincing mathematical evidence on the identity or otherwise of their 
authorship. . . . When I conv^erted some of Yule^s tables into diagrams 
I was struck by their general resemblance to skew distributions with 
which I have recently been dealing in some entomological problems, 

. . . which distributions, I found, became normal and symmetrical if 
the logarithm of the number was taken as a basis for subdivision into 
groups instead of the number itself. 

Taking the logarithm of the number as a basis for subdivision 
into groups instead of the number itself accomplished the same 
end as the plotting of the original groups on a logarithmic or 
ratio scale. 


GROWTH CURVES 

Xot all curves shaped like frequency polygons or curves are 
in truth graphs of frequency distributions. Some growth curves 
assume shapes very similar to frequency curves.*^ Figure 54 
is an illustration of a growth curve, showing the increase in 
Chlorella vulgaris cultures over a period of hours. The two 
curves contrast the peak of growth for two different-sized inocu¬ 
lums; in both cases the rate of multiplication per cell varied 
inversely with the density of population, not only in the early 
stages of growth but throughout the growth period in each 
culture. 


BIVARIATE SERIES 

Bivariates are cross classifications of two variable charac¬ 
teristics possessed in common by the objects being studied. 
Graphs of bivariates are sometimes confused with frequency 

^ Williams, C. B., “A Note on the Statistical Analysis of Sentence-length 
as a Criterion of Literary Style,” Biometrikaj Vol. 31 (1940), Parts III, IV, 
pp. 356-361. 

^ For other types of growth curves, see Chap. XX. 
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distributions because in some cases their shape resembles the 
frequency curve. Charts of bivariates, however, may assume 
almost any shape, and the center of the distribution may have no 
more importance than any other part of it. Good examples of 
bivariate comparisons may be found among the great variety 
of vital events when they are related to the different ages by 
their frequency of occurrence. 

Table 9 and Fig. 55 present a set of such distributions. These 
are clearly bivariate comparisons. The x scale in these charts 



Hours 

Fig. 54. — Growth curve showing the rate of increase in population in Chlorella 
vulgaris cultures as a function of time. [Pratty RobertsoTiy "'Influence of the 
Size of the Inoculum on the Growth of Chlorella Vulgaris in Freshly Prepared Culture 
Medium ” American Journal of Botanyy Vol, 27 {Januaryy 1940), p. 54.J 

is the variation in age, from childhood to old age, representing a 
heterogeneous scale with respect to many vital events, such as 
susceptibility to certain types of disease, accident, etc. Differ¬ 
ence in age constitutes in itself an attribute introducing lack of 
homogeneity where such a reference is made of it. With refer¬ 
ence to many types of diseases, man at very tender ages and at 
old ages is a different being from man at middle life or in the 
prime of youth. Such bivariates have no reference to central 
tendencies—the matter of central tendencies is irrelevant. 
What is sought is a picture of the association between the two 
variables, and the very character of the data is such that there 
can be no expectation of a piling up of frequencies about one 



Death rate per 100,000 
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average or central tendency. Figure 55 is presented to show a 
number of examples of bivariate charts. It is readily seen that 
when the purpose is understood such charts are very useful as 
a method of picturing vital statistics; but merely because the 
shape of the two last examples resembles the frequency polygon 
it does not follow that these are true frequency distributions. 


Table 9.— Death Rates per 100,000 Population' in the United States, 
1929, FROM Specified ('ax ses, by Age* 


Age 

group 

T uberculosi.s 
of the lungs, 
male whites 

Scarlet 

fever, 

male whites 

Cerebral 
hemorrhage, 
male whites 

Broncho- 
]>neumonia, 
male whites 

Puerijeral 

septicemia, 

female 

whites 

0- 

7.01 

11.29 

2.06 

182.00 


5- 

2.27 

5.81 

0.59 

6.44 


10- 

3.13 

1.74 

0.47 

2.85 

0.15 

15- 

22.37 

1.11 

0.83 

3.42 

9.94 

20- 

56.33 

0.94 

1.50 

4.33 

23.01 

25- 

72.28 

0.92 

2.19 

4.66 

24.72 

30- 

80.34 

0.79 

4.24 

6.70 

22.25 

35— 

86.17 

0.60 

9.95 

9.38 

18.48 

40- 

95.66 

0.19^ 

21.47 

13.91 

8.18 

45- 

101.08 

0.18 

45.22 

16.47 

0.99 

50- 

100.32 

0.28 

85.37 

22.38 


55- 

105.27 

0.09 

170.99 

29.77 


60- 

102.63 

0.17 

286.15 

46.03 


65- 

114.62 

0.23 

506.14 

77.05 


70- 

106.77 


814.09 

i 124.62 


75- i 

110.39 


1,323.92 

i 238.82 


80- ) 


{ 

\ 

2,015.65 ! 

445.22 


85— 



\ 

2,477.50 

845.00 


90- 


76.09 


2,365.00 

1,035.00 


95-* 

100-i 



.f 

4,320.00 

1,920.00 



* The rates in this table were calculated from data on total deaths by ages in the total 
registration area of the United States in 1929 according to the Bureau of the Census (1932), 
Thirteenth Annual Report on Mortality Statistics, 1929, pp. 196-197, 198-199; 202- 
203, 206-207, 210-211; and population of the United States by age groups as reported 
in the Ahstraci of the Census (1930), p. 183. In 1929, the death registration area of conti¬ 
nental United States included 95.7 per cent of the total population, 
t Rates in italics based on less than 10 deaths. 

The odd shapes that may be assumed by bivariate charts 
are shown by these illustrations. They may be U shaped; they 
may be J shaped; they may be S shaped; and, of course, they 
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may be shaped like an ordinary frequency distribution, but when 
they are this is a matter of coincidence, without significance.^ 
Figure 56 is an illustration of a bivariate chart of data in the 
field of the natural sciences, which is shaped like a frequency 
curve and which even uses the word “ frequency’’dn the title of 
one of its units, though it is not a frequency curve. It is a chart 
of a bivariate comparison—the amplitude in centimeters com¬ 
pared with frequency in cycles per second. 



211 27.3 27.5 27.7 

Frequenc'y, cycles per sec. 

I’lG. 56.—Another bivariate comparison. [Clark, A, L., and L. Katz, ^^Reso¬ 
nance Method for Measuring the Ratio of the Specific Heats of a Gas, Cp/Cv,^' Cara- 
dian Journal of Research, Vol. 18 {February, 1940), p. 30.] 

Figure 57 shows the relationship between inventories and 
shipments of all manufacturing industries in the United States 
and is a bivariate chart. The dotted line on the figure represents 
the average relationship of inventories to shipments based on 
the 23 ^-year period from 1939 through the second quarter of 1941. 
Deviations from this relationship by the quarterly items were 
small during the base period, the expansion of inventories being 
generally in proportion to the expansion of shipments. In 
contrast, inventories increased phenomenally in relation to 
shipments during the latter half of 1941 and the first half of 

^ Cf, also such a type of frequency distribution as that described b}' 
Ihomas V. Pearce, “An Unusual Frequency Distribution—the Term of 
Abortion, Biornetrika, Vol. 22 (1930-1931), pp. 250-252. 
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1942. Protective buying replaced immediate production needs 
as a motive for much of the inventory accumulation during this 
second period, and stocks expanded far out of line with the indi- 



SHiPMENTS, TOTAL FOR QUARTER (BJLLIONS OF DOLLARS) 

CO 43-93 


Fig. 57.—A third example of a bivariate comparison. [Source: Survey of Current 
Business, Vol, 23 (1943), pp.3-9.] 

cated requirements of production, assuming that the shipments 
give an indication of requirements for production. 

STATIC VARIATION AND DYNAMIC VARIATION 

In statistical analysis there are two general forms of variation. 
The static form of variation is that occurring at a given point 
in time or occurring in such a manner that time may be rationally 
regarded as irrelevant to the variation. Where the variations 
that occur are a function of time, however, the variation is 
dynamic and requires different methods of analysis. In the 
main, the methods of analysis of static variation center in the 
treatment*of the frequency distribution, whereas the methods 
of analysis of dynamic variation call for a different application 
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of principles. The same fundamentals, however, are used in 
the analysis of dynamic variation or time series as those used 
for the analysis of frequency distributions; only the applications 
differ. 

Rational Frequency Distributions. A rational frequency dis¬ 
tribution is one in which that arrangement of the data is suggested 
by the nature of the matter observed. Such a frequency dis¬ 
tribution is rational also because the variability of a common 
characteristic is chosen as the basis of the particular classification 
and this basis remains comparable among the objects measured. 
Frequently, the same idea is expressed by saying that the data 
are homogeneous; thus a rational frequency distribution means 
one in which the variable is homogeneous. 

Homogeneity may be defined as the condition prerequisite to 
comparability of data with respect to the attribute or factor 
being considered. The negative aspect of this condition is that 
attributes not being considered are judged unimportant for the 
purposes of the study in hand. The positive aspect of this 
condition is that attributes or factors judged important for the 
purpose of the study are taken into consideration. 

For example, if the attribute height of human beings is being 
considered, color of eyes may be judged irrelevant and therefore 
is not considered. But, for a homogeneous study, age, sex, and 
perhaps race are attributes that must be considered because they 
are all correlated with the attribute height and cannot therefore 
be judged unimportant in studying height. Unimportant 
attributes (those ignored) have zero correlation with the attri¬ 
bute studied. Attributes correlated with the attribute studied 
must be taken into consideration in order to ol^tain homogeneous 
data. In the example of heights, homogeneity is obtained by 
classification, that is, by taking heights of a particular class in 
which the correlated attributes are constants. Thus, heights of 
mature Caucasian males may be taken as one homogeneous 
group; another homogeneous group would be heights of mature 
Caucasian females; another would be sixteen-year-old Caucasian 
males; etc. 

An important result of homogeneity is that no particular cauvse 
of bias or cumulated variation is present. On the contrary, the 
causes of variation consist of many minute mutually uncorrelated 
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(or independent) causes of variation that occur according to the 
law of large numbers, in other words, in a random manner.^ 

Irrational Frequency Distributions. By disregarding the ele¬ 
ment of time present in a time series, whose natural arrangement 
is according to time occurrence, the data may be reclassified and 
arranged in an array, or a frequency distribution. Such a 
rearrangement would conceal the natural time sequence originally 
present in time-series data when in their natural or rational order. 
This type of frequency distribution is irrational as a method of 
summarization. The multiple forces affecting variability in a 
time series are not usually operative at random or in a mutually 
independent manner. On the contrary, the causes of variation 
may and usually do form a cumulative series of mutually depend¬ 
ent variations. It is to be noted in passing that Figs. 46 to 48 
are not distributions of time series. In the data for each of these 
frequency distributions the attribute summarized is for a specified 
time and all the variables are for that specified time—thus time 
is held constant, and the variation shown in the histogram is 
uncorrelated with time. These are rational frequency dis¬ 
tributions. But as soon as the data are viewed in their dynamic 
aspect, that is to say, are correlated with time, the many biasing 
attributes or factors of time destroy homogeneity in the data.^ 
For example, with respect to the price of sugar taken as a time 
series, the supply at a subsequent period might tend to be larger 
as a result of the relatively high price • existing at the earlier 
period; and as a consequence the previous high price is a cause 
of a later lower price. The existence of a price situation, morever, 
at a given time may produce technological changes in the pro¬ 
duction and distribution of sugar that in turn will be a dominant 
factor in the determination of a subsequent price. 

In spite of the fact that the procedure of reclassifying time 
series and arranging the data in frequency distributions is 
irrational, it has legitimate uses in statistical analysis. There 
is a place for irrational procedure in the progressive development 
of knowledge; but, when used, the user should be conscious of the 
irrationality involved.® 

> For careful consideration of the law of large numbers, see Chaps. IX-XI; 
aJso see J. G. Smith and A. J. Duncan Sampling Statistics and Applications, 
pp. 101-103, hereafter referred to by the short title Sampling Statistics. 

* For further discussion of time-series analysis see Chaps. XIX-XXV. 

3 Fischer, Ludwig, The Structure of Thought, p. 360. Cf. for illustrations 
of such rearrangement of time series, Dickson H. Leavens, “Frequency 
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VARIABLE QUALITIES, OR ATTRIBUTES 

When statisticians have to deal with variable qualities, such as 
different colors, different races, different climatic conditions, 
different geographical locations, or different intellectual or moral 
capacities, their problems are principally questions of the con¬ 
sistent use of class or group distinctions. Usually there is no 
need for elaborate quantitative treatment. Yet, so far as 
possible, statisticians strive to convert quality, or attribute, 
differences into quantitative terms, and when that is accom¬ 
plished, their analysis is similar to the analysis of frequency 
series. It has been found, for example, that certain tests can 
be made to provide quantitative measures of differences in 
intelligence, native or acquired; and a large scope for statistical 
analysis lies in the field of education and psychology through 
the use of these tests. 


Distributions Corresponding to Time Series,” Journal of the American 
Statistical Association, Vol. 26 (1931), pp. 407-415. 




PART II 


Analysis of Frequency Distributions 

CHAPTER VI 

SUMMARIZATION AND COMPARISON 

For summarization and comparison of static variation the 
fundamental tool of analysis is the frequencj^ distribution, its 
graphic presentation, and the analysis of its characteristics. The 
frequency distribution portrayed in a table or in a graph gives 
a picture of the whole of the variation relative to some 
particular matter; but how can comparisons be made? The 
frequency table, especially if large numbers of magnitudes are 
involved, even though it is admittedly better than a haphazard 
arrangement of the data, requires study before the mind can 
grasp its full significance. If two frequency distributions of 
heights {e.g,j of mature males in New Jersey in 1800 and of 
mature males in New Jersey in 1900) are to be compared, the 
frequency table could be used, but the total number of cases 
measured might be different in each year taken, which would 
make it more difficult to discern the similarity or lack of similarity 
of the two distributions. To make comparisons, a chart could 
be drawn; but a chart may be large or small depending on the 
scale used, and differences would then appear from purely 
arbitrary, mechanical causes having no real significance. More¬ 
over, if the heights of these same males and also their weights 
are to be compared, a comparison of nonhomogeneous units 
(inches of height and pounds of weight) is required. Clearly 
some method or methods of summarization and comparison of 
frequency distributions must be devised. 

Use of Frequency Distributions. The common practice of 
attaining a summary figure by averagingis familiar to all, but 
it should be clear that an average, taken by itself, is indeed a 
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very summary expression for a variable. It is one value, used 
to rej^resent a whole series of variittions; and a study of 
tioi^ about the average may he as important as or more 
than the study of the average alone. "In statistics and in most of 
the fields of study that use statistics and statistical methods, the 
average is generally a convenient point of departure for a more 
adequate analysis of the variable.^ 

Types of Comparison, There are six possible ways in which it 
may be desirable to obtain summary figures and to make com¬ 
parisons. This may be explained by the use of diagrams, as 
follows: 

In Fig. 58, the central tendency, or average, is located at 
which is plumb with the peak 
of the frequency curve. In this 
figure the central tendency is g 
typical, in the sense that it is a ^ 
magnitude that occurs more |. 
frequently than other magni- ^ 
tudes. It may be looked upon 
quite rationally as a norm, or 
typical value. In such a case, Quantity varit^tions 

the average value has a signifi- Fig. 58.—A central tendency as a 
cance for itself, as a summary 

value, but its principal use is still a comparative one. For 
example, suppose that in Fig. 58 the quantity variations (the 
X scale) are heights of children of a specified age while the curve 
represents the number of children having the indicated heights. 
The question is asked whether or not a certain child is normal in 
height. If the child has less height than height Ay how much less 
must he be so that lack of development in this respect indicates 
need for medical advice? At once it is suggested that it is impor¬ 
tant to determine how much on the average children vary from 
this normal height. Accordingly, the principal use of the average 
as a summary figure, when used as a norm, is to compare indi¬ 
vidual variations with the average and to compare individual 
variations with the average amount of variation to be expected. 

The second type of comparison is the difference in central tend¬ 
encies existing between two distributions a and 6, as illustrated 

^ Fisher, R. A., Statistical Methods for Research Workers (1941), Section 1. 
References to section numbers are valid for any edition of this book. 
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in Fig. 59. This difference is measured by comparing the aver¬ 
ages of the two distributions; for example, by the comparison of 
the average height of children in third grade with the average 
height of children in sixth grade. Such a comparison is rational 
only where the units of the two frequency distributions are 
comparable. 



Quantity variations Quontity variations 

Fig. 59.—Two different central tend- Fig. 60.—Similar central tenden- 
encies. cies but different variability about 

the central tendencies. 

The third t 3 q)e of comparison is illustrated in Fig. 60, in 
which some sort of measurement of the variability of the varia¬ 
tions about the average is required for making the compari¬ 
son; for example, an average of the variations from the central 
tendency could be used. Such measures are called “measures of 
variability. 



A 

Quantity variations 


Fig. 61.—Similar central tendencies, but different types of skewness of distribution 
about central tendencies. 

Figure 61 illustrates the fourth type of comparison. Fre¬ 
quency curves a and b have peaks plumb with the same quantity, 
A, but a is skewed to the left and h is skewed to the right. The 
central tendency A is a value of greatest frequency in both 
curves, but the lower range of curve o is farther from A than is 
the lower range of b; and the upper range of o is much nearer 
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B 

Quanfi-f^ varici+ions 

Fig. 62.—Different central tendencies 
and different variabilities. 


to A than is the upper range of 6. School grades sometimes 
have a frequency distribution like a, with the most common grade 
around 70 or 80 and with very few above 90, yet with some grades 
below 20 or even as low as 10. Personal income^ are distributed 
like curve &, with the most common income at an amount near 
to the low^est and a few incomes 
of amounts far above the most 
common amount. When fre¬ 
quency curves like a or bin Fig. 

61 are encountered, it is desir¬ 
able to have some way to meas¬ 
ure skewness and evaluate its 
importance in connection with 
the interpretation of other statis¬ 
tics about the frequency curve. 

In the fifth type of comparison, illustrated by Fig. 62, not only 
may it be desirable to compare average with average and variabil¬ 
ity with variability in aggregate terms, but it may be essential also 
to find a way to compare relative variability. The variability in 
b relative to its average may not be so much larger than the 
relative variability in a as the graph seems to show. The graph 

shows that the absolute varia¬ 
bility in b is greater; but it may 
be that the relative comparison 
is the more significant one. To 
make the relative comparison 
requires the calculation of further 
information. 

Curves a and b in Fig. 63, which 
illustrates the sixth type of com¬ 
parison, have the same central 
tendencies and approximately the 
same average deviations about 
their central tendencies; but b has 
a relatively greater concentration 
of small variations close to the central tendency and also rela¬ 
tively more extreme variations than does a. Another way of 
looking at this difference is to note that the shoulders of a are 
broader than the shoulders of h and that the top of a is flatter 
than the top of 6. The relative flatness of top or breadth of 



Ouarrfiiy variations 


Fig. 63.—Similar central tend¬ 
encies, similar variabilities, and 
absence of skewness, but different 
concentrations at center and along 
tails. 
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shoulder of a distribution is called ''kurtosis/^ The measure¬ 
ment of this characteristic is important in determining the relative 
importance of small variations from average in the two curves. 

It appears to follow from the above discussion of six types 
of comparison that the analysis of frequency distributions requires 
the calculation of the average and, in addition, the calculation 
of measures of dispersion.^ 

THEORETICAL SIGNIFICANCE OF FREQUENCY CURVES 

Histograms and Frequency Curves. It has been noted that a 
frequency distribution may be graphed in the form of a histogram, 
that is, a figure in which the frequency of any class interval is 
represented by a rectangle erected on that interval as a base and 
with a height equal to the observed frequency. ^ If the data are 
continuous in character, that is, if they change by very small 
jumps, it may become reasonable to represent the frequency 
distribution by a smooth frequency curve rather than by a broken 
histogram. 

Area Histograms, It is possible to make certain modifications 
in the form of the ordinary histogram to represent the frequency 
of cases occurring in any class interval, not by the height of the 
rectangle, but by the area of the rectangle. If the class interval 
is equal to unity, an area histogram is identical with one in which 
frequencies are represented by heights, since the altitude multi¬ 
plied by the base equals the area. But if the class interval is 
greater than unity, the height of an area histogram will be pro^ 
portionately reduced; if the class interval is less than unity, the 
height will be proportionately increased. This follows because in 
an area histogram the frequency of any class interval is given by 
the height of the rectangle erected on it, multiplied by the length 
of its base (that is, by the size of the class interval). In histo¬ 
grams of the area type, it follows that the total area of the 
histogram always equals the total number of cases, N, 

Relative Frequencies, The histogram may be further modified 
by making it represent relative or proportional frequencies, 
rather than absolute frequencies. Following is a table showing 
a proportional frequency distribution:® 

^ See pp. 168-195. 

* For illustrations, see Figs. 64 and 66, pp. 187-188. 

» Cf, p. 141. 
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MaiiSrnal Mortality in Cities op 100,000 or More Population in the 
United States, 1938 


Death rates 
(number per 1,000 
live births) 

X 

Number of 
cities 

F 

Relative number of cities 

Expressed as 
proportions 

F 

N 

Expressed as 
percentages 

(1) X 

(1) 

(2) 

(3) 

(4) 

1- 

2 

0.022 

2.2 

2- 

16 

0.172 

17.2 

3- 

18 

0.193 

19.3 

4- 

20 

0.215 

21.5 

b- 

15 

0.161 

16.1 

b- 

10 

0.108 

10.8 

7- 

4 

0.043 

4.3 

8- 

6 

0.064 

6.4 

9- 

0 

0.000 

0.0 

10- 

2 

0.022 

2.2 


93 

1.000 

100.0 


In the above table the figures in column (3) represent the pro¬ 
portionate frequencies, namely, the proportionate number of 
cities having the specified maternal mortality rates. Since this 
illustration has a class interval of 1, an area histogram could be 
obtained by plotting the frequencies of column (2) in the form of 
vertical bars, with the heights equal to the respective frequencies. 
A proportional area histogram could be obtained by similarly 
plotting the frequencies shown in column (3); because in the 
resulting histogram, the area of each rectangle would represent 
the proportion of the total number of cases falling in a class 
interval; it would represent F/N instead of F. The total area 
of such a histogram will always equal unity, just as the total of 
column (3) equals unity. This will be true no matter what the 
form or shape of the histogram, because 2F = N. 

Frequency Curves, Suppose that the data, from which the 
histogram has been constructed, are a sample from a very large 
set of cases, theoretically an infinite set. For instance, the 
data might be the heights of 100 adult males of the white race, 
instead of the mortality statistics above illustrated. The 100 






164 


ANALYSIS OF FREQUENCY DISTRIBUTIONS 


heights, then, would be a sample of the heights of all adult men 
of that race, presumably millions of men. In such a relatively 
small sample, the size of the class interval cannot be reduced with¬ 
out causing the histogram to show very irregular fluctuations. 
If, however, many cases are added to the number in the sample, 
say heights of 200 men, the size of the class interval could be 
reduced, for example, from 10 units to 5 units, \vithout causing 
the occurrence of such irregularities. In fact, if the number in the 
sample is made larger and larger and at the same time the size of 
the class interval is continuously reduced, the histogram will 
tend to become more and more regular and the tops of the rec¬ 
tangles, which are getting narrower and narrower, will come 
closer and closer to forming a smooth continuous curve (a fre¬ 
quency curve). In such a manner the frequency curve may be 
viewed as the limit that an area histogram of relative frequencies 
approaches as the number of cases is increased and the size of the 
class interval is reduced indefinitely. The frequency curve is the 
distribution of a theoretically infinite set of data, mth a theoreti¬ 
cally infinitesimal class interval. 

Being the limit approached by an area histogram of relative 
frequencies, the frequency curve has a total area (between the 
curve and the x-axis) that is always equal to unity. Further¬ 
more, any section of area under the curve ^ will give the relative 
frequency of the cases falling within the class interval marking 
off that section of area. It is upon this basis that tables of 
relative frequencies are constructed for certain well-known 
frequency curves. ^ 

Uses of Frequency Curves, Frequency curves are hypothetical, 
but they are idealizations of frequency distributions that are reaj. 
They serve many useful purposes and in the theory of statistics 
they are indispensable. One important use of frequency curves 
is the graduation of frequency distributions obtained from actual 
observation. Suppose, for example, that a frequency distribu¬ 
tion has been constructed, using a class interval of 10 units. 
Suppose further that the number of cases is such that any smaller 
class interval would introduce marked irregularities into the 
distribution, irregularities that, it is believed, would riot be present 

1 See Fig. 91, p. 264 and Fig. 94, p. 277. 

* See Appendix, Table VI. 
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if an infinitely large number of cases were observed. In this 
case a frequency curve fitted to the distribution (histogram) 
may be the best means of estimating the true frequency for any 
given class interval. In other words, the frequency curve affords 
a graduation for the frequency distribution. ^The frequency 
curve makes it possible to interpolate values not given directly 
by the original sample frequency distribution. 

Besides serving to graduate a given set of data, frequency 
curves facilitate in other ways the description and comparison of 
frequency distributions. For instance, the peakedness or flat¬ 
ness of a particular frequency curve, called the normal fre¬ 
quency curve,’' is taken as the standard to which the peakedness 
or flatness of a given distribution is generally referred. Again, 
theoretical analysis shows that data affected by certain kinds of 
forces will tend to be distributed in the form of particular types 
of frequency curves. Certain types of curves, therefore, become 
the expected norm for all data affected by particular kinds of 
forces. As a consequence, the hypothesis that variations in a 
given set of data have resulted from certain forces may be tested 
by noting how well the distribution of the data conforms to the 
type of frequency curve that these forces may be expected to 
produce. In such instances frequency curves help to explain the 
underlying causes of variation. Such an analysis is of special 
importance when it is assumed, as it is in so many statistical 
procedures, that chance is the fundamental cause of variation. 
It is to be noted that a difference in the general form of two 
frequency distributions may in some cases be looked upon as of 
more fundamental importance than a mere difference in their 
averages, dispersion, and the like, because such a difference in 
form may indicate a contrast in the type of forces causing varia¬ 
tion in the data. To detect a fundamental difference of this 
kind frequency curves are used. 

Still another useful purpose served by frequency curves is in 
sampling analysis. Since a chapter is subsequently devoted to a 
discussion of sampling, it need merely be touched upon and 
simply illustrated in very general terms at this point. ^ For 

^ See Chap. XII; see also Smith, J. G., and A. J. Duncan, Samplina 
Statistics^ pp. 107-109, Parts II and III. 
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illustration, suppose that a large number of balls, each with a 
number written on it, are placed in a big bowl and thoroughly 
mixed. Suppose that 10 balls are drawn at random from the 
bowl and their numbers read off and averaged. Suppose that 
this sampling operation is repeated over and over again, the 
balls *»being replaced and thoroughly mixed after each set of 
drawings. Experience shows that unless the distribution of num¬ 
bers is freakish, the distribution of sample averages will approxi¬ 
mate the so-called normal’’ frequency curve. If, instead of 
the average of the respective 10 readings, a certain measure of 
the variation around their averages, known as the '^variance,” 
had been recorded in each instance, then the frequency distribu¬ 
tion of these measurements of variation would have tended to 
conform to a frequency curve known as the curve.”^ The 
significant thing is that sampling distributions ” of this kind tend 
to conform to specific frequency curves that may be described 
by definite mathematical formulas. In general, these formulas 
are expressed in terms of the characteristics of the population” 
(in the illustration, the bowl of numbers) from which the sam¬ 
ples are drawn. The consequence is that if a random sample 
has been obtained from an unknown population, it is possible 
from knowledge of the sampling distributions of various sample 
measurements to'make certain inferences regarding the nature 
of the population from which the sample has been drawn. This 
is probably the most important use that is made of frequency 
curves in statistical analysis. 

MEASUREMENTS OF SUMMARIZATION AND COMPARISON 

Population, Parameters, and Statistics. Population. To say 
that the population of the United States is one hundred and 
thirty million people is a familiar use of the word ‘^population.” 
In statistics the word is used in the same familiar sense, but it is 
also used in a more general sense,, referring to the count of 
persons or of animals of any kind or even to the count of inani¬ 
mate things. To statisticians the term means all the things, 
animate or inanimate as the case may be, of a given kind in 
the known universe or in a specified universe, for example, all the 
people on the earth, or all the people in the United States if the 

' This is read '*chi square''; the letter is the Greek small chi. 
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universe is more specific. An example of an inanimate popula¬ 
tion would be all the petroleum in the known universe or, if a 
more specific universe is considered, all the petroleum in the 
United States. 

Parameters and Statistics. In the theory of^ statistics the 
measurements of the characteristics of the population are called 
^'parameters.” The average height of all people living in the 
United States is a parameter of the population. No one has 
ever actually measured the heights of all the people living in the 
United States, and it is not likely that anyone ever will do so. 
Nevertheless, this population does exist. In practice, it is 
much easier to estimate the average height of all the people by 
taking the average of a sample of the people. This latter 
average, the average of the sample, is called a "statistic.” 
Accordingly, parameters are measures of the characteristics of the 
population, and the corresponding sample measures are statistics 
commonly used to estimate these parameters. A statistic is thus 
a value computed from an observed sample in order to char¬ 
acterize the population frf5m which it is drawn. Parameters are 
the characters of the population.^ 

In accordance with this terminology, the quantities to be 
obtained as measures of central tendencies are "statistics,” the 
arithmetic mean is a "statistic,” the range (difference between 
the highest and lowest magnitude) of a frequency distribution is 
a "statistic.” 

Averages. There are several kinds of averages. The one 
most familiar is the arithmetic mean. The others most gen¬ 
erally presented are the median, mode, geometric mean, and 
harmonic mean. The most commonly used averages are the 
mean, the median, and the mode. In this chapter each will be 
viewed in its simplest aspect, and at the same time the symbolic 
language associated with the analysis of frequency distributions 
will be introduced. 

The Arithmetic Mean. By definition, the arithmetic mean 
is the sum of the cases divided by the number of cases. For exam¬ 
ple, taking a simple case of ungrouped data, i.e., where the 
frequencies are 1 throughout (each X occurs once), Fi, F 2 , . . . , 
F 7 each = 1 : 


1 Fisher, op. cit.j pp. 7-8, 41. 
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X 


F 

Xi 

2 

Fi 

1 

X 2 

3 

Fi 

1 

X 3 

4 

F, 

1 

X 4 

6 

F, 

1 

X 5 

8 

F, 

1 

Xs 

9 

F, 

1 

Xr 

10 

F, 

1 

2X = 

42 

XF 

= 7 


The sum of the variable magnitudes in this case is 42. The 
number of variable magnitudes is 7. Hence, by definition, the 
arithmetic mean is 4^, or 6. 

Symbolically, 

42 = 2^, i.e.y -j- X 2 “h * * * “h 
7 = = Nj i.e.f Fi F 2 * * * ~\~F7 


The arithmetic mean is represented by the symbol X, and 
hence 


^ _ XFX^ 
^ -“IT 


( 1 ) 


In frequency distributions the F is not equal to 1 throughout 
but varies. An illustration of the calculation of the arithmetic 
mean of a frequency distribution is shown below: 


X 

2 

3 

4 

5 

6 
7 


F 

3 

3 

6 

9 

6 

3 


XF or N = 30 


FX 

6 

9 

24 

45 

36 

XFX = 141 


It should be noted that the sum of the X^s cannot be obtained 
by adding the first column because the various X's occur 3, 6, or 
9 times. Consequently, the sum of the X^s is obtained by mul¬ 
tiplying each X by its respective frequency and then adding 
the products. 


2FX = 141 
SF or AT = 30 
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and therefore, by definition, 

X = w = 4.7 

If the frequencies of a frequency distribution are expressed in 
relative numbers, i.6., if each frequency is expressed relative to 
the total number of cases, Fi/N, F 2 /N, . . . , Fn/N, the arith¬ 
metic mean is merely the sum of the third column, as follows: 



F 

F , 

X 


_ X 


N 

N 

2 

0.1 

0.2 

3 

0.1 

0.3 

4 

0.2 

0.8 

5 

0.3 

1.5 

6 

0.2 

1.2 

7 

0.1 

0.7 


1.0 

4.7 


Following the definition, the arithmetic mean is the sum of the 
third column divided by the sum of the second column; but the 
sum of the second column is 1, by definition. Consequently, 
the arithmetic mean is the sum of the third column. 

This modified form of the definition of the arithmetic mean is very 
convenient in certain statistical problems. 

The sum of the deviations of the cases from the arithmetic 
mean is equal to zero. This may be demonstrated as follows: 
Given the variable Xi, X 2 , . . . Xn, X = 7^FX/N. 

^ Fi(^ - Z) = Fixi 
F2{X2 - X) = F2X2 


-X) = FnXn 

llFX — NX = 'LFx by adding 

The small x is used regularly to refer to the deviations of the 
variable from the arithmetic mean. 

When added, SFX becomes NX because X is constant and 
2F is equal to N. 
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If the value of X given in Eq. (1) is substituted in 
SFX -NX = l^Fx 

it becomes 

^FX - N = XFx 
N 

By canceling the N, this becomes 

2FZ - XFX = iFx 

and hence 

XFx = 0 (2) 

The Median and the Quartiles^ In its original simplicity and 
by definition the median is not a mathematical concept like the 
arithmetic mean. On the contrary, the median is a position 
average. By definition, the median is that value than which there 
is an equal number of cases larger and smaller. When the cases 
are arranged in an array, the median is either the value of the 
middle one (when there is an odd number) or some value between 
the two middle ones (when there is an even number). Nor¬ 
mally, in the latter instance, the arithmetic mean of the two 
middle cases is taken as the median value. To illustrate from a 
very simple example with an odd number of cases: 

X 

1 

2 

3 

6 

7 

8 
9 

X\, or 6, is the median by definit^n, because it is the middle 
one in the array. Mi = 6 (Mi is the conventional symbol for 
median). 

It is to be noted that X = 5.143. 

In this illustration it is seen that 1, the first case, is 6 smaller 
than the median, while 9, the last case, is only 3 larger than the 
median. This preponderance of smallness of the variable 
results in an arithmetic mean smaller than the median. By 
definition, the arithmetic mean is affected by every variation and 
consequently by extreme variations. It is affected by the size 
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and the number of cases above and below it. The median, 
on the other hand, by definition, is not affected by the size of 
the cases above or below it. 

When the frequencies vary, the median may be found as in 
the following illustration: 

X F 

1 2 

2 4 

3 5 

4 8 

5 7 

6 3 

7 _2 
31 

Thus, there are 2 cases where X = 1; there are 4 cases where 
X = 2; etc. In all, there are 31 cases (2/^ = 31 = X), and the 
middle one is the sixteenth. Mi = 4. That the median is 4 is 
quickly seen by the examination of the cumulated frequencies in 
the third column. This is equivalent to taking the median equal 
to the (X + l)/2th case, a procedure often adopted when dealing 
with ungrouped data.^ 

In general terms, the first quartile Qi is that value below which 
one-fourth of the cases fall and above which three-fourths of the 
cases fall. Similarly, the third quartile Qz is that value below 
which three-quarters of the cases fall and above which one-fourth 
of the cases fall. The median is the second quartile, or Q^. In 
the above frequency distribution, X^/4 = 7f and 3A^/4 = 2Z\. 
Qi should thus be some value below which 7f cases fall and above 
which 23i cases fall. For this distribution, it happens that the 
seventh and eighth cases are identical, and it therefore follows 
that the value of Qi is the common value of the seventh and eighth 
cases, or Qi = 3. If the seventh and eighth cases had not been 
the same, then Qi could be taken as a value between the values 
of the seventh and the eighth case to be found by interpolation. 

For ungrouped data, it is recommended that a uniform and 
systematic method of interpolation be adopted, as foliows:^ 

^ When the data are grouped, it is simplest to find the median by interpola¬ 
tion within the class interval for the A^/2th case. This method is described 
and illustrated in the next chapter. See pp. 218-220. 

2 For the method of interpolation when the data are grouped, see pp. 
218-220. 


Cumulated F 
2 
6 
11 
19 
26 
29 
31 
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Take Qi as the (iV’/4 + i) case, Q2, or Mi, as the {N/2 + i) 
case, and Qs as the (32^/4 + i) case. Consider, for example, the 
cases 5, 9,11,16, 25, 31, 38, 43, 45, and 49. The (N/4 + i) case 
would be the or third case; i.e., Qi = 11. The median 

would be the + i) or 5ith case. Since there is no 5ith case, 
however, but only a fifth case, 25, and a sixth case, 31, the median 
is taken as the value that lies just halfway between the fifth and 

sixth cases, i.e.. Mi = 25 + = 28. The third quartile 

would be the + i) or eighth case; i.e., Qs = 43. 

As another illustration, suppose 51 is added to the set of 
numbers, making a total of eleven instead of ten numbers. Then 
the first quartile would be the + i) or the 3Tth case. But 
there is no 3ith case, only a third case, 11, and a fourth case, 16. 
Hence, Qi is taken as the value that lies one-fourth of the distance 
between the third and fourth cases. The difference between the 
third and fourth cases is 16 — 11 = 5, and Qi is therefore taken 
as 11 + T = 12^. Similarly, Mi is the (V* + i) or sixth case, 
which is 31. 

The Mode. As in the case of the median, the mode is not a 
mathematical concept. Moreover, it is not a ^^position average.^' 
The mode is an average that is described in terms of relative 
frequency of occurrence. It is defined as the magnitude that 
occurs more frequently than any other. The mode is the most 
probable magnitude and might be considered a ‘‘probability 
average,’’ because it is often thought of in terms of probabilities. 
It may be illustrated as follows: 

X F 

2 1 

3 2 

4 4 

6 7 

7 8 

8 5 

9 4 

10 ^ 

34 

By definition the mode (Mo) is 7, because this value occurs 
more frequently than any of the others. The probability of the 
mode is ^ and is greater than the probability of any other value 
of X. 
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It is to be noted that the X of this example is 6.706 and the 
median is 7. The. mode is not affected by the size of the cases 
above or below it, nor is it affected by the number of cases 
above or below it, within certain rational limits. For example, 
in the illustration, two magnitudes {X = 8) cOuld be added 
to the distribution and the mode would remain 7, as before; 
but if five magnitudes (X = 8) were added to the distribution, 
then 8 would become the mode, as its frequency would then 
be 10. 

It has been established that, when the distribution is only 
moderately skewed, the mode can be estimated from the mean 
and median by the following approximate formula: 

Mo = X - 3(X - Mi) (3) 

Accordingly, the mode may be estimated if the mean and 
the median have been calculated. In actual problems involving 
grouped data the mean and the median are both more accurately 
determinable than is the mode, and for this reason the above 
formula often gives more satisfactory results than any convenient 
direct procedure. This is called the ^^mathematical mode.” 
It should be emphasized that the formula should not be used, 
however, if the distribution is very skewed. 

The Geometric Mean. The geometric mean (G.M.) is a 
mathematical concept and is defined as the nth root of the 'product 
of n variables X. Accordingly, the geometric mean of 5, 8, 
and 25 is (5 X 8 X 25)^ = 10. The geometric mean may 
also be defined as the antilogarithm of the arithmetic mean of 
the logarithms of the variable X, i.e., 

log G.M. = (4) 

This may be illustrated as follows: 

log 5 = 0.69897 

log 8 = 0.90309 

log 25 = 1.39794 

^ = 3)3.00000 

1.00000 


The antilogarithm of 1.00000 is 10; hence, G.M. = 10. 
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Just as the arithmetic mean balances the aggregate deviations 
so the geometric mean balances the ratios of the variations; 
that is, 


V X • • • X 

G.M. ^ G.M. ^ ^ G.M. 


= 1 


( 5 ) 


For, by taking logarithms, this expression becomes 

log X\ + log X. + log .Y 3 + • • • + log Xn - N log G.M. = 0 

or log X ~ N log G.M. = 0 

But from Eq. (4), 


N log G.M. = S log X 

and hence the expression is shown to be true. 

In some types of problems the geometric mean gives more 
satisfactory results than the arithmetic mean. For example, it 
is necessary to use the geometric mean to average percentage 
increases of population over successive years or decades or to 
average percentage changes in income, production, and the like.^ 
Thus, in the column marked X of the following table the esti¬ 
mated national income of each year is expressed as a percentage 
of the preceding year; 


Year 

Estimated national 
income produced in 
the United States,^ 
billions of dollars 

Increase in national 
income expressed as 
percentage of pre¬ 
vious year 

1933 

47.3 


1934 

54.6 

115.43 

1935 

59.2 

108.42 

1936 

68.9 

116.39 

1937 

73.1 

106.10 

1938 

67.0 

91.66 

1939 

69.7 

104.03 


* Survey of Current Business, Vol. 20 (March, 1940), p. 19; (April, 1940), p. 11. 

If the average annual percentage increase is obtained by 
calculating the arithmetic average, the answer obtained is 
642.03/6 = 107.01, which represents an average annual percent- 


^ Chaddock, R. E., Principles and Methods of Statistics (1925), pp. 126- 
127; Croxton, F. E., and D. J. Cowden, Applied Business Staiistics (1939), 
pp. 225-226. 
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age increase of 7.01 per cent. Now, if 7.01 is used as a constant 
annual percentage increase from 1933, the following figures would 
be obtained: 

Constant 7.01 Percentage 

Year Increase from 47.3 in 1933 

1934 107.01 per cent of 47.3 = 50.62 

1935 107.01 per cent of 50.62 = 54.17 

1936 107.01 per cent of 54.17 = 57.97 

1937 107.01 per cent of 57.97 = 62.14 

1938 107.01 per cent of 62.14 = 66.50 

1939 107.01 per cent of 66.50 = 71.16 

But in 1939 the actual figure, as shown in the preceding table, 
was 69.and the average percentage j^early increase could 
not have been so large as 7.01. To obtain the correct per¬ 
centage increase, the geometric and not the arithmetic mean 
should be calculated in this instance. Following the formula 
given above for the geometric mean, it is calculated for this 
problem as follows: 


Year 

Estimated national 
income produced in 
the United States, 
billions of dollars 

Logarithm of the 
percentage increase 
log X 

1933 

47.3 


1934 

54.6^ 

2.06232 

1935 

59.2 

2.03511 

1936 

68.9 

2.06591 

1937 

73.1 

2.02572 

1938 

67.0 

1.96218 

1939 

69.7 

2.01716 


If the average annual percentage increase is now obtained 
by calculating the geometric mean of the rates of increase, 
by first taking the arithmetic mean of the logarithms 


12,16840 

6 


2.02807 


and then taking the antilogarithm (antilogarithm of 2.02807) 
the answer obtained is 106.68 or an average annual percentage 
increase of 6.68 per cent. If 6.68 is assumed to be.the average 
annual percentage increase since 1933, the following figures would 
be obtained; 
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Constant 6.' 

68 

Percentage 

Year 

Increase from 47.3 in 

1933 

1934 

106.68 

per 

cent 

of 

47.3 

= 50.46 

1935 

106.68 

per 

cent 

of 

50.46 

= 53.83 

1936 

106.68 

per 

cent 

of 

53.83 

= 57.43 

1937 

106.68 

per 

cent 

of 

57.43 

= 61.27 

1938 

106.68 

per 

cent 

of 

61.27 

= 65.36 

1939 

106.68 

per 

cent 

of 

65.36 

- 69.73 


In 1939 the actual figure, as shown above, was 69.7, to which 
69.73 is a close approximation; and hence the average annual 
percentage increase apparently was in fact close to 6.68. 

The Harmonic Mean. The harmonic mean (H.M.) is the 
reciprocal of the average of the reciprocals of observations of the 
variable X, thus: 


H.M. 



( 6 ) 


Accordingly, the harmonic mean of 5, 8, and 25 would be found 
as follows: 

From a table of reciprocals or by calculation, the reciprocals 
of 5, 8, and 25 are determined—0.200, 0.125, and 0.040—and 
hence the harmonic mean, by definition, is 

_?_ ^ = 8 22 

0.200 + 0.125 + 0.040 0.365 

The geometric mean of these three numbers, as discovered above, 
is 10; the arithmetic mean is 5 + 8 + 25 = 38 divided by 3, 
or 12.67. It is thus seen that the arithmetic mean is the 
largest, the geometric mean next, and the harmonic mean the 
smallest of these three averages. It is always true that^ 

H.M. < G.M. < X (7) 

The usefulness of the harmonic mean arises in connection with 
certain types of problems in which variable quantities of one 
variable are compared with a constant quantity of another. 
For illustration, speeds may be looked upon as variable numbers 
of miles per minute (a constant quantity of time) or as variable 
amounts of time required to cover a given distance. Similarly, 

^ For a proof see G. R. Davies and W. F. Crowder, Methods of Statistical 
Analysis in the Social Sciences, p. 313. 
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prices may be looked upon as variable amounts of money per 
unit of goods sold (a constant quantity of goods) or as variable 
amounts of goods that can be purchased with $1. In many 
such problems the choice of the variable for which the quantity 
is always constant is optional, depending upoiT the type of 
information it is desired to emphasize. There is a nice dis¬ 
tinction between the mean and the harmonic mean wherever 
such interchangeability is present. This will be illustrated 
by examples. 

The accompanying table shows data on prices of corn. 

Wholesale Price of No. 3 Yellow Corn 


Year 

Dollars per Bushel 

1913 

0.61 

1919 

1.59 

1929 

0.93 

1939 

0.50 


Source: Survey of Current Business, April, 1940, p. 18. 

In the table the amount of money varies, but the quantity 
of corn is constant. The average price per bushel may be cal- 

3 63 

culated directly from this table, thus: = 0.9075. In order 

to obtain the harmonic mean, the reciprocals of these prices must 
first be calculated. 

Wholesale Price of No. 3 Yellow Corn 


Year 

Bushels per Dollar 

1913 

1.64 

1919 

0.63 

1929 

1.08 

1939 

2.00 


The average of these reciprocals must be computed, thus: 

^ = 1.3375. 

4 

The reciprocal of the latter number must be obtained, namely, 
0.74766. This last number is,the harmonic mean of the prices 
expressed in dollars per bushel. The harmonic mean, therefore, 
of the prices per bushel of No. 3 yellow corn is approximately 
75 cents; and it is the price per bushel of the average amount of 
No. 3 yellow com that could have been purchased for $1, or, 
in other words, it is the reciprocal of the average amount of No. 3 
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yellow corn that could have been purchased for $1. If the 
reciprocal of the mean price, 0.9075, is taken, a figure of quite 
different significance is obtained, namely, 1.102. This reciprocal 
is the average number of bushels that could have been bought at 
the mean price. 

In deciding whether to use the arithmetic or the harmonic 
mean in any given problem, it should be determined which 
magnitude should be regarded as the constant (for example, the 
amount of corn bought or the amount of money spent), a matter 
that can usually be decided without difficulty in a practical 
problem. If the data are recorded with the appropriate quantity 
constant, the arithmetic mean may be used. If the appropriate 
item is made a variable as the data are tabulated, the harmonic 
mean must be used. 

Another illustration will serve to clarify further the use of 
the halmonic mean. The efficiency of a fighting airplane may 
be determined, in part at least, by its speed, which can be 
expressed either as the number of miles flown per minute or the 
amount of time required to fly a mile. Following are the results 
of tests of a plane under trial: 

Results op Tests 

Miles per minute. 6, 4, 7, 6, 5 

Is the significant measure the rate at which a plane flies or 
the amount of time required to fly a number of miles? If it is 
admitted that the rate at which the plane flies is the important 
consideration (that is, the number of minutes required to fly a 
mile), the reciprocal of the harmonic mean is the relevant meas¬ 
ure, inasmuch as the recorded data make the time element con¬ 
stant and not the distance flown. The arithmetic mean, if 
calculated, would not be lacking in significance, but its reciprocal 
should not be compared with rate measures in which the number 
of miles is constant and the time varies. The average number of 
miles per minute is ^ = 6.6. The reciprocal of this number, 
0.17857, is not the harmonic mean, and it is not the average time 
that it takes to travel a mile. On the contrary, 0.17857 minute 
is the amoxmt of time it requires to go a mile when traveling at the 
average number of miles per minute. The average amount of 
time required to travel a mile is a different thing, namely, the aver¬ 
age of i, h h i minutes, or 0.179 minute. While the two 




SUMMARIZATION AND COMPARISON 


179 


results are close together in value, it would make a large difference 
in calculations having to do with hours of time if the arithmetic 
mean were used when the harmonic mean ought to have been 
used. 

The Concept of an Average as a Summary Figure,' The general 
significance of an average as a summary figure may be illustrated. 
Suppose that information concerning the heights of mature males 
in Newark, N.J., is desired. Heights of all the mature males 
in Newark are therefore measured to the nearest i inch. The 
data collected will constitute complete information about the 
heights of mature males in Newark. But this knowledge, in 
untabulated form without summary figures, is not easy to com¬ 
prehend. It is necessary to analyze this total knowledge in 
some manner so that it may become more significant. The 
manner in which the analysis will proceed depends upon the 
object in mind; for example, an answer to any one of the following 
questions may give a more significant view: 

1. What height will coincide with the greatest number of 
recorded observations? The answer to this question is, of 
course, the mode. 

2. What is the height such that greater and smaller heights 
have been recorded with equal frequency? The answer to this 
question is the median. 

3. What is the height such that the sum of the squares of the 
differences between it and the recorded observations is a mini¬ 
mum? Or what is the height such that the algebraic sum of 
the differences between it and the recorded observations is zero? 
The answer to this question will be the arithmetic mean. 

4. What is the height H such that the product of the ratios of 
the recorded observations to H is unity? The answer to this 
question will be the geometric mean. 

5. If several rates of speed were given in miles per second, how 
many seconds on the average will be required to travel 1 mile? 
The answer is the reciprocal of the harmonic mean. (Heights 
could not be used as an illustration because the harmonic mean 
is significant only in the dual-variable type of quantity expres¬ 
sion, as explained above.) 

The term “average” is a generic term, and any one of these 
summary figures may be called an average; the decision as to 
which average should be used depends upon what question is to 
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be answered. If the median height is known and the height of a 
man is greater than the median, it can be inferred that the man is 
taller than most men. If a man's height is equal to the mode, 
it is known that he has the most common, or usual, height. If 
height is analyzed in the abstract, as it might be in research on 
the effects of heredity and environment, the arithmetic mean is 
likely to be used, for such analyses ordinarily involve the solution 
of problems in mathematical terms. 

MOMENTS 

Definition. In physics, ^‘moment" is a measure of a force with 
respect to its tendency to produce rotation. The strength of 
the tendency depends on the amount of force and the distance 
from the origin of the point at which the force is exerted. If a 
number of forces, Fi, F2, . . . , F„, at distances Xi, X2, . . . X„, 
are applied, the moment of the first force about the origin is 
FiXi, the moment of the second force is F2X2, etc. These 
moments are additive so that 2FX is the total moment about 
the origin. If the total moment is divided by the total force, 
the quotient is termed ^^a moment coefficient." The formula is 
SFZ/iV, where iV' = 2F is the total force. 

It will be recognized that the formula for a moment coefficient 
is identical with that for an arithmetic mean. This identity 
has lead statisticians to speak of the arithmetic mean as the 

first moment about the origin." Technically the mean is a 
moment coefficient and not a total moment, but in the case of 
frequency curves, with which mathematical statistics is primarily 
concerned, the total frequency N is generally taken as unity, ^ 
so that the total moment and the moment coefficient are identical. 
In any case, it has become customary in statistics to speak of 
the mean X = 'LFX/N as the first moment about the origin, and 
the distinction between total moment and moment coefficient is 
ignored. 

The concept of moments is also extended to higher powers. 
Thus in statistics 'LFX^/N is termed the “second moment 
about the origin," and 'EFX^/N is called the “third moment 
about the origin," etc. In general, the moments about zero 
are as follows: 

^ See pp. 276-277 and Appendix, Table VI. 
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SFX 
ZFX^ 

SFZ" 

When the moments are calculated, not from zero, 
the mean, or “centroid,” they are defined as follows: 

XFx 

XFx^ 

M2 

SFx^ 

N 


SFx» I 

= / 

where x = X — X, It will be noted that it is a convention in 
statistics to use small x to represent deviations from the 
arithmetic mean. 

The first moment about the mean is the sum of the deviations 
from the mean multiplied by their respective frequencies and 
divided by the number of cases in the frequency distribution. 
The second moment is similarly obtained except that the devi¬ 
ations are squared before adding. To obtain the third moment 
the deviations are cubed, multiplied by their respective fre¬ 
quencies, added, and divided by the number of cases; to obtain 
the fourth moment the deviations are raised to the fourth power; 
to obtain the nth moment the deviations are raised to the nth 
power. This is indicated in the above equations. As already 
demonstrated above, ^ the first moment about the mean is equal 
to zero. In mechanics, the square root of the second moment 
is the radius of gyration of a set of s equal particles, with respect 
to a given centroidal axis.^ 

Purpose of Moments. For statistics the moments serve 
primarily as intermediary values. The moments about the 

^ See pp. 169-170. 

* Cf. Rietz, H. L., Mathematical Statistics (1936), p. 20. 
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mean are intermediary values useful for calculating measures 
of variability, skewness, and other characteristics of the fre¬ 
quency distribution. Because of their great convenience in 
obtaining measures of the various characteristics of a frequency 
distribution, the calculation of the first four moments about the 
mean may well be made the first step in the analysis of a fre¬ 
quency distribution. This valuable feature will be illustrated 
in the ensuing pages and in the next chapter. Following are 
important generalizations concerning moments measured from 
the arithmetic mean for all frequency distributions, 

/xo = 1 

Ml = 0 

M2 = 

and in symmetrical distributions, 

M3 = 0 
M5 = 0 
M7 = 0 

for all ^^odd'^ numbered moments. 

VARIABILITY 

It was indicated above that the chief interest of statistics is 
in variability; summary figures such as averages are useful as 
points of departure for further study of the frequency distribu¬ 
tion. It may be noted that the principle of averaging is funda¬ 
mental throughout; for all the various methods of summarizing, 
whether it be central tendencies or variations from points of 
central tendencies, use the principle of averaging as a method of 
summarization or measurement. 

The Range. The most obvious method of measuring vari¬ 
ability is to take the difference between the highest value and the 
lowest value; this difference is called the ‘‘range.'' Thus in a 
set of several hundred grades, if the highest grade is 92, and the 
lowest 13, the range is 92 — 13 = 79. The range is easily 
understood and easy to compute, but it is dependent entirely on 
the two extreme items. It is therefore seldom used as the 
measure of variability when accuracy and stability of results are 

* Read **sigma square.” 


I (10) 

I (11) 
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♦ 

desired. It is better to use some measure of variability that is 
dependent on more than just two, if not all, of the cases. ^ 

The Average Deviation, The average deviation (A.D.) is the 
arithmetic average of the variations of the data from their central 
tendency. This measure of variability may be ^illustrated as 
follows: 

Variable 
X 
2 

3 

4 
6 
8 
9 

10 


The deviations have to be added without regard to sign—other¬ 
wise, their sum is zero. In this example there are seven devia¬ 
tions (one of which is zero), and hence the average deviation 
is or 2.57. 

A.D.X = ^ (12) 

where x = X — X. 

The average deviation could be measured from the median or 
the mode, as well as from the arithmetic mean. In fact, it is 
usually measured from the median since it is least when so 
computed. 2 Let X — Mi = x\ Then the formula for the 
average deviation from the median is 

A.D.Mi = ^ (12a) 

Mi is used as a subscript to A.D. to indicate that the average 
deviation is measured from the median. It is to be noted that 
the average deviation is a measure of variability that, is based 
on all the observed cases. 

1 See, however. Smith and Duncan, Sampling Statistics, pp. 294r-296, for 
the use of the range in sampling analysis. 

* Cf. Kelley, Truman, Statistical Method (1923), p. 74. 
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mean are intermediary values useful for calculating measures 
of variability, skewness, and other characteristics of the fre¬ 
quency distribution. Because of their great convenience in 
obtaining measures of the various characteristics of a frequency 
distribution, the calculation of the first four moments about the 
mean may well be made the first step in the analysis of a fre¬ 
quency distribution. This valuable feature will be illustrated 
in the ensuing pages and in the next chapter. Following are 
important generalizations concerning moments measured from 
the arithmetic mean for all frequency distributions, 

MO = 1 

Ml = 0 

M2 = cr^* 

and in symmetrical distributions, 

M3 = 0 
M6 = 0 
M7 = 0 

for all '^odd'^ numbered moments. 

VARIABILITY 

It was indicated above that the chief interest of statistics is 
in variability; summary figures such as averages are useful as 
points of departure for further study of the frequency distribu¬ 
tion. It may be noted that the principle of averaging is funda¬ 
mental throughout; for all the various methods of summarizing, 
whether it be central tendencies or variations from points of 
central tendencies, use the principle of averaging as a method of 
summarization or measurement. 

The Range, The most obvious method of measuring vari¬ 
ability is to take the difference between the highest value and the 
lowest value; this difference is called the ^^range.’’ Thus in a 
set of several hundred grades, if the highest grade is 92, and the 
lowest 13, the range is 92 — 13 = 79. The range is easily 
understood and easy to compute, but it is dependent entirely on 
the two extreme items. It is therefore seldom used as the 
measure of variability when accuracy and stability of results are 

♦ Read sigma square.^' 
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desired. It is better to use some measure of variability that is 
dependent on more than just two, if not all, of the cases. ^ 

The Average Deviation, The average deviation (A.D.) is the 
arithmetic average of the variations of the data from their central 
tendency. This measure of variability may be ^illustrated as 
follows: 


Variable 

X 

2 

3 

4 
6 
8 
9 

10 


Deviations from the Mean 
(-? = 6 ) 

X 

-4 

-3 

-2 

0 

2 

3 

_4 

+9 

-9 

X\x\ = 18 


The deviations have to be added without regard to sign—other¬ 
wise, their sum is zero. In this example there are seven devia¬ 
tions (one of which is zero), and hence the average deviation 
is or 2.57. 

A.D.X = ^ (12) 

where x = X -- X. 

The average deviation could be measured from the median or 
the mode, as well as from the arithmetic mean. In fact, it is 
usually measured from the median since it is least when so 
computed. 2 Let X — Mi = x'. Then the formula for the 
average deviation from the median is 

. Six'! , 

A.D.Mi -- (12a) 


Mi is used as a subscript to A.D. to indicate that the average 
deviation is measured from the median. It is to be noted that 
the average deviation is a measure of variability that, is based 
on all the observed cases. 


1 See, however. Smith and Duncan, Sampling Statistics^ pp. 294-296, for 
the use of the range in sampling analysis. 

* CJ. Kelley, Truman, Statistical Method (1923), p. 74. 
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In the foregoing exainple each X occurred only once, and so 
the were all unity. When there is more than one of any X, 
the formulas are 

AD? - 

A.D.X - 
A.D.Mi — —^—* 


Standard Deviation. The most generally used measure of 
\^riability, however, is the standard deviation. This will be 
readily understood when it is seen that the standard deviation 
is easily treated mathematically; the average deviation has 
very distinct limitations in this respect owing to the disregard 
of plus and minus signs. In the case of the standard deviation 
this defect is overcome by squaring the deviations before they 
are averaged and then taking the square root of the average. 
The symbol for the standard deviation is the small Greek letter <t, 
read '^sigma.^^ By definition, the standard deviation is the 
square root of the average of the squared deviations from the mean. 
Symbolically, 


As may be seen by comparing this definition with the definition 
of the moments about the mean [Eqs. (9)], the standard devia¬ 
tion is the square root of the second moment; i.e.j 

<r = jjlI (13a) 

Followiiig is an illustration of the computation of the standard 
deviation: 


Deviations from the 


Variable 

Mean (X = 5) 

Deviations Squared 

X 

X 


1 


16 

2 

-^3 

9 

3 

-2 

4 

4 

-1 

1 

5 

0 

0 

6 

1 

1 

7 

2 

4 

8 

3 

9 

9 

4 

16 



Ss* = 60 
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In the above illustration the standard deviation is (-^)^, or 
2.58. It will be noted here that the F^s are all unity and there¬ 
fore do not enter into the calculations. 

Variance. The square of the standard deviation is called the 
‘^variance.” Since = XFx^/Ny the variance is merely another 
name for the second moment about the mean [see the definition 
of this in Eqs. (9)]. The second moment is smaller when calcu¬ 
lated from the mean than when calculated from any other point. ^ 

SKEWNESS 

Definition and Significance. Skewness’’ means asymmetry. 
Frequency distributions that have extreme variations resulting 
in a longer tail in one direction from the peak than in the other 
direction from the peak are asymmetrical in appearance. Such 
distributions are called ^'skewed distributions.’^ Up to this 
point the discussion has concerned methods for measuring 
central tendencies (averages) and methods for measuring vari¬ 
ability (average deviation and standard deviation). The 
significance of an average may be considerably modified when 
considered in comparison with the average deviation or standard 
deviation; but, in addition, the significance of the averages 
depends upon the symmetry or lack of symmetry in the dis¬ 
tribution of individual cases. Measures of skewness are accord¬ 
ingly desirable. 

Skewness Measured by Relationship between the Mean, the 
Median, and the Mode. Skewness is most easily measured by 
the relationship between the mean, the median, and the mode; 
for it mil be recalled that the mode is not affected by the mag¬ 
nitudes or number of variations above or below it. The median 
is affected by the number of variations, but not by their mag¬ 
nitudes. The mean is affected by both the number and the 
magnitudes of the variations from it. Consequently, it would be 
expected that 

1. The mode, by definition, will remain at the point of greatest 
frequency, whether or not distribution is skewed. 

2. The median will be pulled away from the mode in dis¬ 
tributions that are skewed, since the larger number of items on 
one side pull it from the point of greatest frequency. 


^ C/., for proof, p. 215. 
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3. The mean in such distributions will be pulled away from the 
mode still farther since the larger number and extreme magnitude 
of the items on one side pull it farther away from the point of 
greatest frequency. 

These points are illustrated in the following frequency dis¬ 
tributions:^ 


1. Symmetrical 

2. Skewed Positively 

3. Skewed Negatively 

X 

F 

X 

F 

X 

F 

1 

3 

1 

6 

1 

1 

2 

4 

2 

8 

2 

2 

3 

6 

3 

9 

3 

3 

4 

9 

4 

8 

4 

5 

5 

10 

5 

7 

5 

7 

6 

9 

6 

5 

6 

9 

7 

6 

7 

3 

7 

10 

8 

4 

8 

2 

8 

8 

9 

3 

9 

1 

9 

6 


54 


49 


51 


(1) 


(2) 


(3) 


J? = 5 

X 

- 3.92 

X 

= 6.10 


Mi - 5 

Mi 

= 3.69 

Mi 

= 6.33 


Mo = 5 

Mo 

= 3 

Mo 

= 7 


<r = 2.08 

<r 

= 2.05 

O' 

= 2.01 


Figures 64 to 66 show in graphic form the frequency dis¬ 
tributions given on this page. The relationship between the 
three averages will be more clearly visualized from these figures. 

In Fig. 64, for example, all three averages equal 5. In Fig. 65, 
which is the positively skewed frequency distribution, the three 
differ from each other; the mode is 3, the median is 3.69, and the 
mean is 3.92. The extreme variations toward the higher values 
give the frequency distribution a longer tail to the right, or 
toward the higher values of X, and this pulls the median and the 
mean in that direction from the mode. 

In Fig. 66 a negatively skewed frequency distribution is 
illustrated. This histogram is a graph of the third set of figures 
shown on this page. In this figure, too, the averages differ 
from each other, but here the mode is the largest. The extreme 
variations toward the lower values give the frequency dis¬ 
tribution a longer tail to the left, or toward the lower values of X, 

1 Calculations of averages were made on the assumption that the x values 
are mid-points of class intervals of unity. 
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and this pulls the median and the mean in that direction from 
the mode; so while the mode is 7, the median is 6.33 and the 
mean is 6.10. 



Fig. 64.—A symmetrical frequency distribution. 



Fig. 65.—A positively skewed frequency distribution. 


Skewness may accordingly be measured by the difference 
between the mean and the mode. For the above examples, 

X-Mo = 0 X-Mo= +0.92 Z ~ Mo = -0.90 

This is a measure of the aggregate amount of skewness, but 
how significant is this amount? One device to weigh this is to 
compare the aggregate amount of skewness with the standard 
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deviation and thereby to obtain a coefficient of skewness, as 
follows: 

For example (2) 


sk 



0.92 

2.05 


= 0.45 


(14) 


For example (3) 
sk 


X - Mo 


ff 


-0.90 

2.01 


-0.45 


The relative amount of skewness, or asymmetry, in these two 
distributions comes out equal, although one is positive and the 
other is negative. 



Fig. 66.—A negatively skewed frequency distribution. 


Another measure of skewness is based on the median and 
the mean. It has been established that in a moderately asym¬ 
metrical distribution, if the mean is pulled a distance P away 
from the mode, the median is pulled approximately two-thirds 
P away from the mode in the same direction; that is, 

Z - Mo = 3(X - Mi) 


Hence, skewness can also be measured by three times the distance 
between the mean and the median, as follows: 


== g - M ) 

<r 


(14a) 


The second equation has the advantage over (14) in that the 
median is often easier to locate than the mode. The mode is 
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frequently difficult to locate in a sample distribution and is 
subject to wide fluctuations of sampling; in addition, its location 
is often dependent merely upon the selection of the class interval. 

Skewness Measured by the Medians and Quartiles, A simple 
measure of skewness, and one that is easily comprehended, is 
obtained by comparing the location in a frequency distribution of 
its median and quartiles. This can be illustrated by taking the 
same three distributions used above and calculating their 
respective first and third quartiles (the medians are already 
calculated). From Fig. 67 it is seen that in the symmetrical 
distribution the first quartile is smaller than the median by the 



Fig. G7.—llelation between the first and third quartiles and the median of a 
symmetrical distribution. 

same amount that the third quartile is larger than the median— 
Qx and Qs are equidistant from the median; accordingly, • 

Qs - Mi - (Mi - Qi) = 0 

If the terms are rearranged, this may be written 

Qi Qz — 2Mi = 0 

From Fig. 68, the positively skewed case, it is seen that the 
first quartile is smaller than the median by an amount con¬ 
siderably less than the amount by which the third quartile is 
larger than the median; accordingly, 


Qz-Mi- (Mi - Qi) = +0.222 
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From Fig. 69, the negatively skewed distribution, it is seen 
that the first quartile is smaller than the median by an amount 
considerably larger than the amount by which the third quartile is 
larger than the median; for Q 3 — Mi — (Mi — Qi) = —0.154. 



- Qj - >\ 

Fig. 68.—Relation between the first and third quartiles and the median of a 
positively skewed distribution. 



Fig. 69.—Relation between the first and third quartiles and the median of a 
negatively skewed distribution. 

If the location of the quartiles compared with the location 
of the median, in each distribution, is now compared with half 
the distance between the two quartiles (i.e., the average dis- 
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tance of the two quartiles from the median), a coefficient or 
relative measure of skewness is obtained and the formula that 
measures skewness is as follows: 

_ 03 - Mi - (Mi - Oi) _ O 3 + Qi 7 - 2Mi ,,,, 
O3-Q1 " Q 

2 

where Q is used as a symbol signifying the semiquartile range. 

Not only is the coefficient of skewness based upon the median 
and quartiles a good one because these are usually easy to find, 
but also this coefficient of skewness has the advantage that it 
has value limits between +2 and —2. It can be no greater 
than +2, that is, when positive skewness is so great that the 
median equals the first quartile. It can be no greater than —2, 
that is, when negative skewness is so great that the median 
equals the third quartile. In Figs. 68 and 69, respectively, 
the coefficients of skewness are +0.146 and —0.106 expressed 
as ratios. Expressed as percentages of skewness, they are 
+ 14.6 per cent and —10.6 per cent. 

Third Moment as a Measure of Skewness, The cube root of 
the third moment is also a good measure of skewness. This is 
due to the fact that (1) if the distribution is symmetrical, 
will be zero; but (2) if the distribution is not symmetrical, the 
third moment will not be equal to zero but mil have a positive 
or negative value according to whether the distribution is 
skewed positively or negatively. This is illustrated by simple 
examples showing a symmetrical and a positively skewed distri¬ 
bution. The negatively skewed distribution is left to the 
student to work out. 


1. Symmetrical Distribution 


X 

F 

X 

Fx 

Fx^ 

Fx^ 

1 

2 

-3 

-6 

18 

-54 

2 

3 

-2 

-6 

12 

-24 

3 

4 

-1 

-4 

4 

- 4 

4 

5 

0 

0 

0 

0 

5 

4 

1 

4 

4 

4 

6 

3 

2 

6 

12 

24 

7 

2 

3 

6 


54 


23 


'LFx == 0 

j:Fx^ = 68 

SFa;* 0 
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2. Positively Skewed Distribution 


X 

F 

X 

Fx 

Fx^ 

Fx^ 

1 

6 

-2 

-12 

24 

-48 

2 

8 

-1 

- 8 

8 

- 8 

3 

10 

0 

0 

0 

0 

4 

4 

1 

4 

4 

4 

5 

3 

2 

6 

12 

24 

6 

2 

3 

6 

18 

54 

7 

1 

4 

4 

16 

64 




TlFx =b 

= 82 

SI 


In the second example, the third moment is equal, not to 
zero, but to M- The measures of skewness by this method 
would be lx- Symbolically, this measure of skewness is 

sk = ^ = A (16) 


It may be seen from the definition of the third moment [Eqs. 
(9)] that this measurement of skewness is the cube root of the 
third moment. Expressed as a coefficient of skewness, where 
the aggregate amount of skewness is in terms of the standard 
deviation, this measure of skewness is as follows: 



(T 


The Beta Coefficients, This last measure of skewness is of 
particular interest, not only because it is based on a wholly 
mathematical procedure (it is not dependent on nonmathe- 
matical summaries like the median and quartiles or the mode), 
but also because it is directly related to one of the so-called 
“beta coefiicients.” The beta coefficients are functions of the 
moments of the frequency distribution that have been found 
very useful in describing and distinguishing various types of 
frequency distributions.^ The two principal beta coefficients 
are and ^ 2 ^ which are defined as follows: 


Pi = 


f4 

A 


mI 


It will be noted that the sixth root of is identically the coeffi¬ 
cient of skewness, sk = /x|/<r, for m 2 = Frequently \/% is itself 

1 Smith and Duncan, Sampling Statistics, 
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taken as a measure of skewness, the square root being given the 
same sign as from which the third moment is calculated. 

When the beta coefficients are used to describe a frequency 
curve according to a formula developed by Karl Pearson,^ the 

skewness for this curve as measured by ^ -- is found to be 


equal to 


V/Si (182 + 3) 


When the data are such as to warrant the fitting of a smooth 
frequency curve, this is an excellent formula for measuring 
skewness. The curve, of course, does not have to be fitted to 
make use of the formula as a measure of skewness. 

When jSi is small, ix,, when the skewness is slight, and when S 2 
is approximately equal to 3, as it is in the case of a normal dis¬ 
tribution, ^ this last equation shows that is approximately 

equal to twice ^mode being that of the fitted 

Pearsonian curve. When the latter is calculated by the approxi¬ 
mate formula Mo = X — 3(X — Mi), the calculation of half 
the square root of Si will in certain cases serve as a rough check 

on the computation of skewness from the formula sk = —— 

The importance of the second beta coefficient lies in the fact 
that it is a measure of kurtosis. 


KURTOSIS 

Definition. Kurtosis is described by Karl Pearson as follows:^ 

Given two frequency distributions which have the same variability 
as measured by the standard deviation, they may be relatively more or 
less flat-topped than the normal curve. ... A frequency distribution 
may, in other words, be symmetrical, but it may fail to be mesokurtic 
(equally flat-topped with the normal curve), and thus the Gaussian 
curve eannot describe it. 

1 Cf. Smith and Duncan, Sampling Statistics, pp. 134-137. 

2 See the next section. 

3 “Skew Variation, A Rejoinder,” Biometrika, Vol. 4 (1906), p. 173. 
Cited from H. M. Walker, Studies in the History of Statistical Method, p. 182. 
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The “normal curve” to which this quotation refers is repre¬ 
sented by the equation 





(18) 


Because this curve has arisen so frequently in statistics and 
because it has been used as a type with which to compare other 



frequency curves, it has come to be known as the normal curve. 
AlsOj^ince Gauss early recognized its importance, it is sometimes 
called the Gaussian curve. ^ 

As shown in graphic form earlier in this chapter (Fig. 63), 
when there is a marked concentration of very small variations 
about the central tendency, the frequency curve rises to a high 
peak, unlike the normal, or Gaussian, curve, which has a certain 
1 Cf, pp. 294-295. 
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roundness at the top. Kurtosis is a measure that makes it 
possible to describe the relative degree to which this character¬ 
istic exists with reference to any frequency distribution. For 
the normal curve, the relationship between the second and the 
fourth moments is as follows: 



If the ratio of the fourth moment to the square of the second 
moment is less than 3, the curve is flatter than the normal curve; 
and if the ratio is greater than 3, the curve is more peaked than 
the normal curve. Figure 70 shows three frequency distribu¬ 
tions, with ^2 equal, respectively, to 2, 3, and 4; the standard 
deviations of the three curves are equal. It should be noted 
that the smaller area at the peak of the flat-topped distribution 
is accompanied by a loss of area at the tails of the distribution. 
This loss of area at the peak and the tails is compensated by the 
fact that the curve is higher than the normal curve on each side 


Table 10.—Computation op Three Frequency Curves^ 
/32 = 2 ^2 = 3 )32 = 4 


X 

<r 

00 

^2=3 

04 

^04 

■“ 55^04 

^ /32 = 4 

/52 = 2 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

0.0 

0.3989 

1.1968 

0.0499 

-0.0499 

0.4488 

0.3490 

0.2 

0.3910 

1.0799 

0.0450 

-0.0450 

0.4360 

0.3460 

0.4 

0.3683 

0.7607 

0.0317 

-0.0317 

0.4000 

0.3366 

0.6 

0.3332 

0.3231 

0.0135 

-0.0135 

0.3467 

0.3197 

0.8 

0.2897 

-0.1247 

-0.0052 

0.0052 

0.2845 

0.2949 

1.0 

0.2420 

-0.4839 

-0.0202 

0.0202 

0.2218 

0.2622 

1.2 

0.1942 

-0.6925 

-0.0288 

0.0288 

0.1654 

0.2230 

1.4 

0.1497 

-0.7364 

-0.0307 

0.0307 

0.1190 

0.1804 

1.6 

0.1109 

-0.6440 

-0.0268 

0.0268 

0.0841 

0.1377 

1.8 

0.0790 

-0.4692 

-0.0195 

0.0195 

0.0595 

0.0985 

2.0 

0.0540 

-0.2700 

-0.0112 

0.0112 

0.0428 

0.0652 

2.2 

0.0355 

-0.0927 

-0.0039 

0.0039 

0.0316 

0.0394 

2.4 

0.0224 

0.0362 

0.0015 

-0.0015 

0.0239 

0.0209 

2.6 

0.0136 

0.1105 

0.0046 

-0.0046 

0.0182 

0.0090 

2.8 

0.0079 

0.1379 

0.0057 

-0.0057 

0.0136 

0.0022 


1 Columns (2), (6), and (7) give the ordinates of the three curves. Column (6) = (2) + 
(4), and column (7) = (2) + (5), account Being taken of the signs. The figures are derived 
from the formula for a Gram-Charlier frequency curve. See Smith and Duncan, Samplitic 
Statistvcs, pp. 84, 142-144. 
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between the peak and the tails. In other words, it is flat both 
at the peak and at the tails and is high in the shoulders. In 
contrast, the more peaked frequency curve is higher than the 
normal curve at both the peak and the tails and lower than the 
normal curve on each side between the peak and the tails. Its 
shoulders are lower than those of the normal curve. 

SUMMARY 

There are two ways in which frequency distributions differ 
from the so-called ''normal'^ frequency distribution.^ (1) Fre¬ 
quency distributions may have a higher or lower peak than the 
normal frequency distribution. This relative flatness or lack of 
flatness of the peak relative to the normal curve is called “kur- 
tosis. ’ ^ (2) Frequency distributions may have a preponderance of 

variability to the large values or to the small values. This lack 
of symmetry in variability is called skewness. The normal 
distribution and the concepts connected with, its analysis con¬ 
stitute a convenient point of departure for the general analysis 
of variability. In this study of variability, the characteristics of 
kurtosis and skewness are of great importance for the reason that 
a large part of the phenomena studied have characteristics pro¬ 
ducing frequency distributions that are not normal.^’ Even as 
early as the time when Sir Francis Galton was developing his 
theory of correlation (1877-1889), writers on mathematical 
statistics realized that the univariate normal law of De Moivre 
and Laplace could not be regarded as a universal law of fre¬ 
quency distribution; the presence of skewness in homogeneous 
material was certainly as common as that of normality.^ 

It is important to realize that the function of frequency- 
distribution analysis is not primarily to define and measure 
averages but to define, describe, and measure variability. 
Simple averages have relatively limited uses and may lead to mis¬ 
interpretation rather than clarification if used without refer¬ 
ence to the measures of variability, skewness, and kurtosis. 

1 For further description of the normal curve, see Chaps. X and XI. 

2 Pretorius, S. J., Biometrikay Vol. 22 (1930-1931), pp. 109-223. Cf. 
Elderton, W. Palin, Frequency Curves and Correlation (1927). Rietz, 
H. L., Handbook of Mathematical Statistics, Chap. VII, Frequency Curves, 
pp. 92-119, by H. C, Carver, and pp. 288-239 by W. A. Shewhart, 
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In this chapter some of the most generally used methods of 
analysis of frequency distributions have been presented in 
elementary form, in order to keep clear of the complications of 
practical application. It will now be easier to see how these 
methods are applied and just what complications toter into their 
application to real problems. The following chapter gives an 
example of such an analysis, using data from real life. But it will 
be well to close this chapter with a summary of the symbols that 
have thus far been used and that include by far the majority of 
all the symbols used in statistics. Most of the symbolic language 
can be learned from this chapter. If they are mastered, the few 
additional ones will be easy to learn. 

Summary of Symbols 

A variable Xi, A2, X 3 , . . . , X„ or in general X 
Frequencies Fi, F 2 , F 3 , . . . ^ Fn or in general F 
Sum of 2 (Greek capital sigma) 

Number of cases N (equals 2 F) 

Arithmetic mean X 

Deviations from the mean xij xz^ , , . ^ Xn or in general 

X (equals X — X) 

Median Mi 
First quartile Qi 
Third quartile Qz 
Mode Mo 

Deviations from the median x[, x\, . . . , or in general 

x' (equals X — Mi) 

Geometric mean G.M. 

Harmonic mean H.M. 

Average deviation A.D. 

Standard deviation <r (Greek small sigma) 

Chi square (Greek small chi) 

Skewmess sk 
Moments: 

a. About arbitrary origin: 

VI, V2, Vzj • • • , Pn 

b. About the arithmetic mean: 

Ml, M2, M3, . , Mn 

Kurtosis P 2 (Greek small beta) 
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Following is a summary of the formulas that have been used in 
this chapter: 

Summary of Formulas 

SFZ - F 

(1) X = ^ also 

(2) j:Fx = 0 

(3) Mo = X - 3(X - Mi) 

(4) log G.M. = 

( Xl w X 2 _ Xn 1 

^ G.M. ^ G.M. ^ ^ G.M". 

(6) H.M. = ^ 

(7) H.M. < G.M. < X 

SFX SFX^ 2FX» 

(8) .2 = 

SFa: SFx^ SFx» 

(9) Ml jy- ^ M2 jy ^ Mn ^ 
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ILLUSTRATION OF FREQUENCY-DISTRIBUTION 
ANALYSIS 

Data for a Frequency Distribution. Data selected to illustrate 
frequency distribution analysis are presented in Table 11. 
Heights in inches of 300 members of the freshman class of 1943 
were obtained from the records of the Department of Health and 
Physical Education, Princeton University. As presented in the 
table the data are not arranged in a frequency distribution; 
they are listed at random. In order to make a frequency dis¬ 
tribution of the data it is necessary first to decide on the size and 
limits of the class interval to be used in the construction of the 
distribution; for the frequency distribution per se is a method of 
summarization compared with the manner of presentation of the 
data in Table 11. 

CONSTRUCTION OF A FREQUENCY DISTRIBUTION 

The Class Interval. Rules for Determining the Class Interval. 
The class interval is the unit of the frequency distribution; in 
other words, it is the size of the groups in which the data are 
summarized. In the data selected for illustration should the 
groups be 1-inch, |-inch, |-inch, or i^-inch groups? That is, 
should the class interval be 1 inch, a half inch, a quarter inch, or a 
tenth of an inch? 

A general rule for selecting the class interval is that it should be 
such as to make possible, without serious error, the treatment of 
all values assigned to any one of the classes as if they were equal 
to the mid-point or mid-value of the class. The lower limit of the 
class intervals also should be so selected as to facilitate this end. 
If the cases are concentrated, in fact, at the mid-point of the class 
interval or are evenly distributed throughout, it may, without 
serious errors in calculation, be assumed that all cases in the class 
are equal in value to the mid-value. 

Another guide in the selection of the class interval is that it 
should be as large as possible subject to the first condition and 

199 
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to the condition that the interval should not be so large as to 
conceal too much of the character of the variability. Indeed, the 
most important purpose of the class interval is so to summarize 


Table 11. —Heights, 300 Eighteen-year-old Members of the Class of 
1943, Princeton University 
(In inches) 


71.00 

76.50 

70.25 

68.75 

74.40 

70.50 

74.00 

71.50 

69.00 

71.00 

69.50 

73.00 

67.50 

70.50 

70.00 

67.50 

71.25 

70.00 

69.80 

66.75 

68.50 

67.50 

75.00 

72.80 

70.75 

63.00 

72.00 

72.50 

72.50 

69.00 

65.75 

67.00 

70.25 

69.00 

71.25 

70.50 

70.50 

68.25 

69.70 

66.00 

69.75 

69.00 

72.00 

70.30 

72.00 

73.20 

69.00 

71.75 

72.00 

72.25 

72.60 

68.50 

71.20 

70.50 

67.50 

74.40 

67.25 

72.80 

66.50 

68.00 

70.50 

70.50 

70.70 

71.00 

71.50 

69.70 

73.50 

68.00 

67.50 

71.75 

72.25 

70.75 

70.50 

67.75 

70.50 

72.00 

67.25 

66.50 

75.75 

66.90 

73.00 

69.50 

68.25 

70.75 

73.75 

65.50 

70.75 

69.20 

73.25 

70.75 

73.00 

69.70 

70.00 

72.50 

71.75 

71.60 

72.50 

70.80 

75.75 

70.50 

75.50 

75.75 

70.75 

71.00 

69.75 

71.50 

72.75 

69.20 

70.25 

69.50 

71.00 

67.50 

72.00 

67.50 

73.20 

73.75 

68.50 

70.00 

77.25 

67.00 

68.75 

71.30 

73.50 

72.00 

71.00 

69.00 

71.00 

69.50 

74.10 

70.50 

70.75 

72.00 

66.50 

65.75 

73.00 

■ 66.25 

71.00 

70.75 

68.20 

67.75 

75.25 

71.70 

69.50 

67.75 

73.50 

74.20 

72.50 

67.25 

67.50 

70.75 

70.75 

67.00 

72.00 

67.00 

69.50 

69.25 

67.50 

69.30 

74.30 

67.00 

70.40 

70.25 

67.00 

66.50 

72.00 

70.25 

68.70 

71.75 

70.10 

69.50 

67.25 

66.75 

72.50 

71.30 

66.50 

69.20 

69.75 

68.50 

69.75 

68.75 

70.10 

74.00 

68.00 

69.00 

72.50 

73.75 

70.00 

68.10 

68.25 

71.00 

74.50 

72.75 

71.40 

71.25 

73.00 

71.50 

69.00 

71.50 

68.50 

72.80 

72.00 

71.00 

71.50 

62.00 

68.00 

74.75 

69.25 

73.25 

64.25 

75.25 

70.00 

70.50 

68.25 

67.90 

70.75 

70.75 

72.50 

74.00 

71.00 

72.50 

69.50 

72.00 

71.20 

69.50 

67.00 

72.00 

71.50 

68.25 

69.00 

72.00 

69.50 

73.50 

71.30 

70.70 

67.50 

63.60 

72.50 

72.50 

70.50 

71.30 

69.75 

68.50 

71.00 

j 71.00 

71.00 

71.25 

67.50 

69.75 

69.25 

66.50 

69.00 

68.50 

73.00 

68.00 

69.50 

69.25 

74.00 

67.50 

67.00 

68.50 

73.25 

68.50 

74.25 

73.70 

72.00 

69.00 

68.00 

71.00 

68.80 

70.00 

69.50 

71.75 

68.50 

70.00 

69.00 

67.00 

68.50 

71.75 

73.00 

69.75 

75.20 

74.00 

69.50 

70.00 




72.80 

67.75 

69.30 

70.00 




69.40 

70.50 

66.75 

73.40 




68.50 

68.00 

76.25 

67.80 




68.50 

74.75 

71.75 

65.00 
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the data in a frequency distribution as to disclose more clearly 
the character of the variability. If a very small class interval is 
chosen, the character of the variability will not be visible unless a 



very large number of cases are measured; if a very large class 
interval is chosen, significant irregularities in the data may be 
concealed, ' 
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Ordinarily, the size of the class interval should be uniform 
throughout, because different-sized class intervals will compli¬ 
cate calculations. In some cases, however, it is necessary to 
use different-sized class intervals in order to give a proper picture 
of variability. 

If other more important rules are not thereby violated, in the 
interests of simplicity the position of the class interval in the 
range should be such that the limits of the intervals are integers 
or such that the mid-values of the class intervals are integers. 
Where marked concentration about certain values exists, as is 
sometimes the case in dealing with discrete data, these values 
should so far as possible'be made the mid-points of class intervals. 

An Array of the Data. Intelligent determination of the class 
interval is aided by study of the data arranged in an array or 
scatter diagram such as Fig. 71, which is presented to illustrate 
the determination of the proper class interval. In the figure, 
the heights shown in Table 11 are arranged in an array. Because 
inspection of the data in Table 11 led to the suspicion that con¬ 
centration points were present at the i-, i-, and |-inch values, 
the array is presented in rows with these concentration points 
plumbed. Summing the columns as well as inspection of the 
detail of the scatter diagram show the concentration of fre¬ 
quencies at these values. 

Frequency Distribution with Too Many Class Intervals. Exami¬ 
nation of Fig. 71 suggests that a 1-inch class interval beginning 
at 61.875 inches, as shown in Table 12, might be a good class 
interval for the data of Table 11, for the 1-inch class interval 
with the lower limits as shown in Table 12 places the mid-values 
of class intervals at points of concentration. Such a frequency 
distribution contains over 60 rows, however, and, in addition, 
is uneven and irregular in appearance. Ten frequencies occur 
in the interval 66.875-; only five frequencies occur in the interval 
68.625-; twelve frequencies occur in the intervals immediately 
below and above 68.625-. Moreover, it is not clear whether 
the modal class interval is 691, 701, 70f, 71, or 72 inches; because 
an equal number (15) have each of these five heights. 

The 1-inch class interval is too small in this instance to dis¬ 
close clearly the nature of variation in freshman heights. 

A Larger Class Interval Reveals the Character of Variation. If 
1 inch is taken as the class interval, the frequency distribution 
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Tabm 12. —Frequency Distribution op the Heights op 300 Princeton 
Freshmen, Class op 1943 


Heights of freshmen 

X 

Number of freshmen having 
specified heights 

F 

Interval 

Mid-value 

61 . 875 - 

62.00 

1 

62 . 125 - 

62.25 

0 

62 . 375 - 

62.50 

0 

62 . 625 - 

62.75 

0 

62 . 875 - 

63.00 

1 

63 . 125 - 

63.25 

0 

63 . 375 - 

63.50 

1 . 

63 . 625 - 

63.75 

0 

63 . 875 ^ 

64.00 

0 

64 . 125 - 

64.25 

1 

64 . 375 - 

64.50 

0 

64 . 625 - 

64.75 

0 

64 . 875 - 

65.00 

1 

65 . 125 - 

65.25 

0 

65 . 375 - 

65.50 

1 

65 . 625 - 

65.75 

2 

65 . 875 - 

66.00 

1 

G 6 . 125 - 

66.25 

1 

66 . 375 - 

66.50 

6 

66 . 625 - 

66.75 

3 

66 . 875 - 

67.00 

10 

67 . 125 - 

67.25 

4 

67 . 375 - 

67.50 

12 

67 . 625 - 

67.75 

5 

67 . 875 - 

68.00 

9 

68 . 125 - 

68.25 

6 

68 . 375 - 

68.50 

12 

68 . 625 - 

68.75 

5 

68 . 875 - 

69.00 

12 

69 . 125 - 

69.25 

9 

69 . 375 - 

69.50 

15 

69 . 625 - 

69.75 

11 

69 . 875 - 

70.00 

12 

70 . 125 - 

70.25 

6 

70 . 375 - 

70.50 

15 

70 . 625 - 

70.75 

15 

70 . 875 - 

71.00 

15 

71 . 125 - 

71.25 

10 

71 . 375 - 

71.50 

9 

71 . 625 - 

71.75 

8 

71 . 875 - 

72.00 

15 

72 . 125 - 

72.25 

2 

72 . 375 - 

72.50 

12 

72 . 625 - 

72.75 

6 

72 . 875 - 

73.00 

7 

73 . 125 - 

73.25 

5 

73 . 375 - 

73.50 

5 

73 . 625 - 

73.75 

4 

73 . 875 - 

74.00 

6 

74 . 125 - 

74.25 

3 

74 . 375 - 

74.50 

3 

74 . 625 - 

74.75 

2 

74 . 875 - 

76.00 , 

1 

75 . 125 - 

75.25 

3 

75 . 375 - 

75.50 

1 

75 . 625 - 

75.75 

3 

75 . 875 - 

76.00 

0 

76 . 125 - 

76.25 

1 

76 . 375 - 

76.50 

1 

76 . 625 - 

76.75 

0 

76 . 875 - 

77.00 

0 

77 . 125 - 

77.25 

1 



300 
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will contain 17 classes and will appear as shown in Table 13. 
In this frequency distribution the lower limits of the class inter¬ 
vals are so chosen that mid-values are at the 0.625-inch points 
(f inch), which is a balancing center of the concentration points 
at the i-inch intervals because at f inch each mid-value has two 
i-inch concentration points below it and two above it in the 



Fig. 72. —Distribution of heights of 300 Princeton freshmen. (Class interval 

= I inch.) 



Fig. 73.—Distribution of heights of 300 Princeton freshmen. (Class interval = 1 

, inch.) 


1 -inch class interval. This balancing position of the |-inch 
points can be readily seen by an examination of Fig. 71. 

In order to contrast the irregularities in the frequency dis¬ 
tribution using too small a class interval with the regular appear¬ 
ance of the same frequency distribution using a larger class 
interval, Figs. 72 and 73 are presented. Figure 72 is a graph 
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of the frequency distribution of heights of 300 Princeton fresh¬ 
men, using a i-inch class interval. Figure 73 is a graph of the 
frequency distribution of heights of 300 Princeton freshmen, 
using 1-inch class interval. 

The argument for a class interval centered at the |-inch point 
has been based on the assumption that measurements have been 
made to the nearest i inch. In other words, a height recorded 
as 64.25 might be anything between 64.125 and 64.375. If 
measurements were always made to the lowest i inch, then some 
other mid-point would be warranted such as the ^-inch points, 
or integral values. Table 14 is one based on this assumption. 
Since the exact method of measurement is not known and since 
Table 14 is simplest in form, it is adopted for subsequent analysis. 
A graph of the distribution has already been shown in Fig. 71. 

In frequency Tables 12 to 14, the class interval has been listed 
in two ways. (1) It has been described by writing on each line 
the lower limit of the class interval, followed by a dash. (2) It 

Table 13.— Frequency Distribution op the Heights of 300 Princeton 
Freshmen, Class of 1943 


Heights of freshmen ! 

X \ 

Number of freshmen having 
specified heights 

F 

Interval 

Mid-value 

61.125- 

61.625 

1 

62.125- 

62.625 

1 

63.125- 

63.625 

1 

64.125- 

64.625 

2 

65.125- 

65.625 

4 

66.125- 

66.625 

20 

67.125- 

67.625 

30 

68.125- 

68.625 

35 

69.125- 

69.625 

47 

70.125- 

70.625 

51 

71.125- 

71.625 

42 

72.125- 

72.625 

27 

73.125- 

73.625 

20 

74.125- 

74.625 

9 

75.125- 

75.625 

7 

76.125- 

76.625 

2 

77.125- 

77.625 

1 



300 
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Table 14.— Frequency Distribution of the Heights of 300 Princeton 
Freshmen, Class of 1943 


Heights of freshmen 

X 

Number of freshmen having 
specified heights 

F 

Interval 

Mid-value 

62- 

62.5 

1 

63- 

63.5 

2 

64- 

64.5 

1 

65- 

65.5 

4 

66- 

66.5 

12 

67- 

67.5 

31 

68- 

68.5 

31 

69- 

69.5 

47 

70- 

70.5 

48 

71- 

71.5 

42 

72- 

72.5 

35 

73- 

Y3.5 

21 

74- 

74.5 

14 

75- 

75.5 

8 

76- 

76.5 

2 

77- 

77.5 

1 



300 

1 


has been described by writing in the next column the mid-value. 
Obviously, both methods of describing the class interval need 
not always be employed; the conventional procedure is to use 
the lower-limit description rather than the mid-value descrip¬ 
tion. The mid-value can always be calculated by adding one- 
half the class interval to the lower limit of the class interval. 

WORK SHEET FOR FREQUENCY-DISTRIBUTION ANALYSIS 

The frequency distribution having been constructed, the 
procedure for frequency-distribution analysis will now be 
described. Table 15 is a work sheet for the analysis of a fre¬ 
quency distribution; in columns (1) and (2), under X and F, 
is copied the frequency distribution from Table 14. Entries 
in the remaining columns will be explained below. The work 
sheet is so constructed that advantage may be taken of certain 
economies in calculation. These economies arise from two 
sources: (1) the reduction in calculation, due to the use of a short 
method that involves the calculation of the moments about an 
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‘‘arbitrary origin’^ and (2) a reduction in calculation, due to 
the use of class intervals as units of deviation from the arbitrary 
origin. 

Saving Calculation by Obtaining Moments about an Arbitrary 
Origin, In applying the short method, an arbitrary origin, 
which may be called A, is selected. While zero may be taken 
as an arbitrary origin (and often is in certain statistical problems), 
in the analysis of frequency distributions the amount of calcula¬ 
tion is reduced by selecting a value for A somewhere near the 
middle of the range. The moments about the arbitrary origin 
are then calculated by measuring deviations from A in class- 
interval units, that is, in d/i’s. Sometimes d' is used to symbolize 
d 

V- The savings in calculation are due to the fact that all desired 

mathematical statistics can then be computed by the use of 
formulas from the four moments about the arbitrary origin. 

Saving Calculation by Using Class-interval Units. Saving in 
the amount of calculation to obtain the various statistics results 
if the class-interval unit is used, particularly if the variable is in 
complex or fractional units or in large numbers. This economy 
is brought about by expressing the deviations in terms of class 
intervals instead of in original units, i.e., in d/i^s instead of in 
d^s. As pointed out above, this saving is augmented by selecting 
the arbitrary origin near the middle of the frequency distribution. 
If the arbitrary origin is at or near the middle class interval, the 
largest deviation in terms of class-interval units will then be no 
greater than half the number of class intervals in the frequency 
distribution. Since the deviations must be raised to the fourth 
power in order to calculate the fourth moment, substantial saving 
in calculation is secured by keeping class-interval deviations as 
small as possible by placing the arbitrary origin near the middle 
of the frequency distribution. It will be observed in Table 15 
that the frequency distribution has been copied on the work 
sheet in such a position that the arbitrary origin is near the middle 
of the frequency distribution. It can also be seen that, when 
the class interval is uniform in size, recording the class-interval 
deviations in column (3) is merely a matter of proceeding by 
count above and below the arbitrary origin, that is, —1, —2, 
—3, etc., for successive smaller class-interval values, and 1, 2, 3, 
etc., for successive larger class-interval values. 
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Entering the Frequency Distribution on the Work Sheet, The 
frequency distribution of freshmen heights shown in Table 14 
has 16 class intervals; and if the mid-value of the central class 
interval is to be selected as the arbitrary origin, the first class 
interval, 62-, will be entered in column (1) under “Interval” 
on the line opposite —7 in column (3) (d/i = —7). The remain¬ 
ing class intervals will be entered in succeeding lines until 77- 
will be opposite 8 in column (3) {d/i = 8). The mid-value 
of the central class interval is 69.5, which is opposite 0 in column 
(3) {d/i = 0). The corresponding frequencies are then entered 
in column (2). Full description of the data and their source 
is entered in the space provided at the top of the work sheet. 

Saving Calculation in Use of Work Sheet. The amount of 
calculation involved in the entries required for columns (4) to 
(9) can be reduced to a minimum by the following procedure: 

In column (4), headed F{d/i)y enter the class-interval devia¬ 
tions multiplied by the frequencies [z.c., items in column (3) 
multiplied, respectively, by items in column (2)]. The algebraic 
sum of the figures in column (4), divided by iNT, equals the 
first moment (in class-interval units) about the arbitrary 
origin. 

The figures in column (5), headed F(d/i)2, are obtained by 
multiplying the items in column (4), respectively, by the corre¬ 
sponding items in column (3). The sum of figures in column 
(5), divided by iV, equals the second moment (in class-interval 
units) about the arbitrary origin. 

The figures in column (6), headed F{d/iy, are most easily 
obtained by multiplying the items in column (5), respectively, 
by the corresponding items in column (3). The algebraic sum 
of figures in column (6), divided by N, equals the third moment 
(in class-interval units) about the arbitrary origin. 

The figures in column (7), headed F{d/iY, are obtained by 
multiplying the items in column (6), respectively, by the corre¬ 
sponding items in column (3). The sum of figures in column 
(7), divided by N, equals the fourth moment (in class-interval 
units) about the arbitrary origin. 


The figures in column (8), headed 



are obtained 


by adding 1, respectively, to each figure in column (3) and raising 
the result to its fourth power*. All figures in this column are 
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readily obtained by using a table of powers of numbers.^ The 
sum of column ( 8 ) is not used. 

(d 

The figures in column (9), headed (7 + 1) ^ are obtained 

by multiplying the items in column ( 8 ), respectrvely, by corre¬ 
sponding items in column ( 2 ). The sum of column (9) is used 
to check the arithmetical accuracy of all calculations in the 
work sheet. 

When the work sheet is completed, it will show the following 
values: 


A. i, N, Df (I), zwifj. XFifj. and Zf (^)‘ 

In addition, by means of columns ( 8 ) and (9), the work sheet 
provides a cross check on its internal calculations, since the 


expansion of SF 


0+0* ^ 


2F 


gives the following terms: 
+ 6SF 


(0‘ + (f)' + (O’ + (0 + 


It follows that on a correctly constructed work sheet the sum 
of column (9) equals the sum of column (7) plus four times the 
sum of column ( 6 ) plus six times the sum of column (5) plus four 
times the sum of column (4) plus the sum of column (2). This 
is called a ‘^Charlier check” after the name of the man who first 
suggested its use as a checking device. 

For Table 15 the Charlier check is as follows: 


2 [column ( 2 )] 
42 [column (4)] 
62 [column (5)] 
42 [column ( 6 )] 
2 [column (7)] 
Sum = 2 [column (9)] 


= 300 

= 4 X 292 = 1,168 
= 6 X 2,140 = 12,840 
= 4 = 5,590 = 22,360 
= 45,088 

= 81,756 


^ Cf. Mathematical Tables from Handbook of Chemistry and Physics, pp. 
153-173. For use in making calculations there are a number of convenient 
devices such as the slide rule and calculating machines, as well as logarithms. 
There are also several useful printed tables such as Barlow^s Tables of 
SqiLares, Cubes, Square-roots, Cube-roots, and Reciprocals of Integers up to 
10,000 and Karl Pearson’s Tables for Statisticians and Biometricians; A set 
of logarithms will be found in Appendix, Table I, 
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Table 15.— Work Sheet for Making Calculations in the Analysis 
OF A Frequency Distribution 

Description of Data: Heights of 300 Princeton University Freshmen, 
Class of 1943 

Source of Data: Princeton University’s Department of Health and Physi¬ 
cal Education 

t = 1 in. 

A = 69.5 in. (Mid-point of class interval near center of distribution) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 

(9*) 

Interval 

Mid¬ 

point 

F 

d 

i 



'O' 

'©■ 






-12 










-11 










-10 










-9 










-8 







62- 


1 

-7 

- 7 

49 

-343 

2,401 

1,296 

1,296 

63- 


2 

-6 

-12 

72 

-432 

2,592 

625 

1,250 

64- 


1 

-5 

- 5 

25 

-125 

625 

256 

256 

65- 


4 

-4 

-16 

64 

-256 

1,024 

81 

324 

66- 


12 

-3 

-36 


-324 

972 

16 

192 

67- 


31 

-2 

-62 

124 

-248 

496 

1 

31 

68- 


31 

-1 

-31 

31 

-31 

31 

0 

0 

69- 

69.5 

47 

m 

0 


0 

0 

1 

47 

70- 


48 

1 

48 

48 

48 

48 

16 

768 

71- 


42 

2 

84 

168 

336 

672 

81 

3,402 

72- 


35 

3 

105 

315 

945 

2,835 

256 

8,960 

73- 


21 

4 

84 

336 

1,344 

5,376 

625 

13,125 

74- 


14 

5 



1,750 

8,750 

1,296 

18,144 

75- 


8 

6 


288 

1,728 

10,368 

2,401 

19,208 

70- 


2 

7 


98 

686 

4,802 

4,096 

8,192 

77- 


1 

8 


64 

512 

4,096 

6,561 

6,561 




m 










10 










11 










12 







2 

■ 


292 


5,590 

45,088 


81,756 


* Columns (8) and (9) are for checking purposes. [Z column (9)1 = Z[Column (2)1 4 
42[C<dumn (4)1 + 62[Column (6)1 + 4Z[Column (6)1 + 2[Column (7)1. 























FREQUENCY-DISTRIBUTION ANALYSIS 


211 


Moments about the Arbitrary Origin. The moments about an 
arbitrary origin can be quickly calculated from the sums of 
columns (4) to (7), because by definition the moments about an 
arbitrary origin are as follows: 

2Fd 

XFd^ 

XFd* 

"4 = nr 

SFd’* 

where X — A = d. 

If A were zero, d would equal X; and the moments would then 
reduce to the form shown in Chap. VI. 

When, as in Table 16, the deviations have been taken in class- 
interval units rather than in original units, the formulas for the 
moments about an arbitrary origin, would be written as follows 



where X — A = ^ (f), in which i is the class interval. 

3 Cf. p. 181. The prime on v means that the y is in class-interval units; 
i,e., v' = v/i, Vj = etc. 
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Accordingly, the moments in class-interval units about an 
arbitrary origin are obtained from the sums of columns (4) to (7) 
of the work sheet by dividing each by N [the sum of column (2)]. 

In Table 15, the moments about the arbitrary origin in class- 
interval units are as follows: 




A = 


A = 


‘m 

300 

2,140 

300 

5,590 

300 

45,088 

300 


= 0.97333 
= 7.13333 
= 18.63333 
= 150.29333 


Moments about the Arithmetic Mean. When the moments 
about an arbitrary origin are obtained, the moments about the 
mean are obtained from the following equations:^ 

/ / 

Ml = 

/ t 

M2 = ^2 

! t 

H = vz 

f t 

M4 = *^4 


— = 0 
n 

- vi 

— Sv2p[ + 


The moments about the arbitrary origin having been calcu¬ 
lated for the frequency distribution of freshmen heights in 
Table 15, the moments about the arithmetic mean in class- 
interval units may now be obtained by the use of Eqs. (2), as 
follows: 

= 0.97333 - 0.97333 = 0 

/z '2 = 7.13333 - (0.97333)2 = 7.13333 - 0.94737 = 6.18596 

M3 = 18.6333 - 3(7.13333)(0.97333) + 2(0.97333)^ 

= 18.6333 - 20.82924 + 1.84420 = -0.35171 
= 150.2933 - 4(18.63333)(0.97333) + 6(7.13333)(0.97333 )2 

- 3(0.97333)4 

= 150.29333 - 72.54552 + 40.54740 - 2.69253 = 115.60268 


Equations (2) for finding the moments about the mean from 
the moments about an arbitrary origin may be proved as follows: 

1 Ml = Ml A, M2 = «2A^ W3 = msA^ etc. 
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Since, in Eqs. (1) for moments about an arbitrary origin, 
i{d/i) = Z — -4, it follows that 


(j) * = - A) = FiXi - FiA 

Fj i = FiiXi - A) = FiXi - FiA 



= F„(X„ - A) = F„X„ - F»A 


By adding, 


(o) 



SFX - NA 


since 2F = N. 

Because A is a constant, di, dt, . . . , dn will vary in propor¬ 
tion as Xi, Z 2 , . . . , X„ vary. Also, since A is a constant, the 
sum of the A’s may be written as the constant multiplied by the 
total frequencies, or NA. If now Eq. (o) is divided by N, 


ib) 



= 2FX _ 
N 


But, by definition. 


and 



(i) = vi(i) = vi 


SFX 

N 


= X, the arithmetic mean 


Therefore, by substitution and transposing, Eq. (b) becomes 

X = A + or X = A + vi (3) 

Accordingly the arithmetic mean of the frequency distribution 
of 300 freshmen heights shown in Table 15 is as follows 

* The result of calculation is 70.47333; but since the beginning data were 
significant to only two places beyond the decimal, the figures beyond .47 
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X = 69.5 + 0.97333, since i = 1 
= 70.47 in. 

It has thus been proved that the arithmetic mean equals any 
arbitrary quantity plus the first moment about that arbitrary 
quantity. In other words, the arithmetic mean of a series of 
magnitudes is equal to any arbitrary quantity plus the mean of the 
deviations from the arbitrary quantity. From Eq. (3) and from 
the fact that d = X — A, it follows that A = X — d and that 
X = Z — d + j'l. Therefore, X — X = d — vi, and 

(c) X = d — vi 

or if d is in class-interval units. 

This value for x may be substituted in the equations defining the 
moments about the mean, as follows:^ 



are not significant. The manner in which the figures are written in Table 
11 , which was taken from the source of the data, indicates accuracy to two 
decimal places. Had the numbers been rounded off to the nearest inch, 
the calculated mean would have significant figures to the nearest inch. 
Nevertheless, if the value of the mean is to be used for making further mathe¬ 
matical calculations to obtain other statistics, it should be carried out to 
several more decimal places in order to give an accurate result to two places 
in the additional statistics. 

1 For definition of moments about the mean, c/., p. 181 . 
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After expanding and collecting like terms, these equations become 



From values given for vi, V 2 , vz, and V 4 , in Eqs. (1), Eqs. (5) may 
now be written as follows: 


Ml = J'l — J'l = 0 

M 2 = J '2 — v\ 

iJiz = V3 Sv2Vi + 2vl 

jjLi = Vi ^ 4:Vzyi + 

which it was said at the beginning of this section would be proved. 
An important corollary follows from the above derivation of the 
second moment (or ‘‘variance,as it is sometimes called). 
Since 

M2 = J'2 — vl 

it follows that the mean square deviation about the mean of the 
observations is less than the mean square deviation about any 
arbitrary quantity; that is, the mean square deviation (o-^) about 
the mean is a minimum—^smaller than it would be if calculated 
from any other average. This is obvious from the equation; 
since v\ is a positive quantity, being a square, M 2 must be less 
than V 2 , 

The Standard Deviation, The standard deviation about the 
arithmetic mean may now be quickly calculated, since it is by 
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definition the square root of the second moment. For the 
frequency distribution of heights of 300 freshmen the standard 
deviation is as follows: 

? = = 2.487 or 2.49 in. 

Since the moments were calculated in class-interval units (see 
page 212 ), this result is also in class-interval units. The standard 
deviation in original units is found by multiplying by i. In the 
present problem, i = 1; hence, cr = a/i = 2.49 in. 

The Beta Coefficients, For the frequency distribution of heights 
of 300 freshmen, the first two beta coefficients are as follows: 

l8i = ^ = 0.00052 

P>2 

- 3.02102 

ul 

Since the betas are ratios having i raised to the same power 
in both numerator and denominator, the fact that the moments 
are in class-interval units instead of original units may be 
disregarded. 

Measures of Skewness and Kurtosis, Measures of skewness 
and kurtosis are also readily determined from the moments about 
the mean. In the frequency distribution of heights of 300 
freshmen, the measure of kurtosis, 182 , calculated above, is 3.021, 
slightly larger than 3. Hence the frequency distribution is 
somewhat less flat-topped than the normal curve. ^ 

Skewness in heights of the 300 freshmen, measured by the 
cube root of the third moment, is —0.7057 class intervals. 
Since i = 1 in., this is —0.7057 inch. 

CALCULATION OF OTHER STATISTICS 

Averages and Measures of Variability. Difficulties in Locaiing 
the Median and the Mode, Consideration of the median, the 
mode, and the quartiles has been left to the last for the reason 
that, in the analysis of frequency distributions with class inter¬ 
vals, these values must be estimated. By definition, the median 
is the value at the center of the distribution, the first quartile 

1 Figure 101, p. 295, is a graph comparing the frequency distribution 
with the ideal normal curve. 
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is the value midway between the lower limit of the range and the 
median, and the third quartile is the value midway between 
the median and the upper limit of the range. The mode is the 
value thsit occurs with the greatest frequency. The calculations 
of these statistics are not based on the work sheet shown in 
Table 15. 

Because they are concealed in the class interval among a group 
of other cases in the same class interval, the quartiles, the median, 
and the mode must be obtained by estimation. Where within 
the range of the class interval is the median? Where within 
the range of the class interval with the largest frequency is the 
mode? These questions have to be answered by interpolation, 
and the value so obtained becomes an abstract quantity—as 
abstract and mathematical in character as the mean, but without 
the latter’s precision. 

The Mode. In the case of the mode, a further difficulty arises 
in finding the correct answer to the question: Which class 
interval should be considered to contain the mode? If different¬ 
sized class intervals are taken in each of several frequency dis¬ 
tributions of the same data, the modal class interval will be 
observed to shift around. The mode, by definition the simplest 
of the several measures, is actually the most difficult average 
to locate. Its accurate computation is more highly mathematical 
than that of the arithmetic mean. If a Pearsonian curve gives a 
good fit to the data, the ideal method of obtaining the mode is to 
find the mode of this curve. A formula for this is given on the 
next page. The disadvantage of this method is that there is no 
way of telling whether a curve is a good fit or not until it is 
actually fitted, and this involves a considerable amount of calcu¬ 
lation just for the sake of finding the mode. 

But simpler measures of the mode are often used. These 
are interpolated values, on the assumption that the mode lies 
in the modal class interval, that is, the class interval that has 
the highest frequency. It is assumed that the general shape 
of the distribution affects the distribution of cases at the point 
of greatest concentration in the following manner: All the fre¬ 
quencies below the modal class interval are pulling the mode near 
the lower limit of that class interval, and all the frequencies 
above the modal class interval are pulling the mode toward the 
upper limit of the interval. The mode is equal to the lower' 
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limit of the modal class interval plus the interpolated part of 
the class interval established by the relationship of the frequencies 
above and below that class interval. In the frequency distribu¬ 
tion of freshmen heights (Table 15), the modal class interval, 
that is, the class interval with the greatest concentration of cases, 
is 70-. There are 129 cases pulling the mode toward the lower 
limit of the class interval 70-, and 123 cases pulling the mode 
toward the upper limit. Consequently, 

Mo = 70 + iff X 1 = 70.488 or 70.49 in. 

The so-called ‘‘mathematical mode,^^ an approximation of the 
mode of the Pearsonian curve that is invalid if the frequency 
curve is very skewed, is calculated from the following equation:^ 

Mo = X - 3(X - Mi)* 

For the frequency distribution of 300 freshmen heights, the 
mathematical mode is calculated as follows: 

Mo = 70.47333 - 3(70.47333 - 70.4375) = 70.366 or 70.37 in. 

The mode of the Pearsonian curve fitted to the data is given 
by the formula: 

Mo = X — (Tsk 


where sk = 


VFi (02 + 3) 
2(5^2 - 6^1 - 9)‘ 


For the frequency distribution of 300 freshmen heights, the 
mode calculated by this equation is as follows: 


Mo = 70.50 in. 


The Median and the Quartiles, Determination of the median 
and the quartiles by interpolation is reasonably accurate if, as 
it is assumed, the cases are evenly distributed within the class 
interval containing the median and the two quartiles, respec¬ 
tively. The calculation of the median and‘the quartiles is 
facilitated by making a column of cumulated frequencies as 
shown in Table 16. The median is equal to the lower limit of 
the class containing the X/2th case plus an interpolated amount 
within the class interval determined by the ratio of the fre- 
1 Cf. p. 173. 

* The median is 70.4375. Cf, the next section. 
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quencies in the interval to the balance of frequencies necessary 
to make up iV/2 frequencies. In the frequency distribution of 
freshmen heights (Table 16), N12 — 150. The frequencies 
are counted cumulatively from the lower limit o{ the first class 
interval (top of the table). By this count, there are 129 cases 
to the lower limit of the class interval 70-. When the point 
70 is reached on the quantity scale, 129 cases have been counted; 
but the median is the value of the 150th case, that is, 21 cases 
beyond 70. From 70 to 71 there are 48 cases. Consequently, 
the ratio of interpolation within the class interval is |^. Accord¬ 
ingly, the estimate of the median in freshmen heights is as 
follows: 


Mi = 70 + li X 1 = 70.4375 or 70.44 in. 


Table 16.— Fhequency Distribution of the Heights of 300 Princeton 
Freshmen, Class of 1943 
(In inches. Class interval 1 in.) 


X 

F 

Cumulative F 

62- 

1 

1 

63- 

2 

3 

64- 

1 

4 

65- 

4 

8 

66- 

12 

20 

67- 

31 

51 

68- I 

31 

82 

69- 

47 

129 

70- 

48 

177 

71- 

42 

219 

72- 

35 

254 

73- 

21 

275 

74- I 

14 

289 

75- 

8 

297 

76- * 

2 

299 

77- 

1 

300 

300 


1. There are 129 cases to X == 70. 

2. Since N/2 — 150.0, this leaves 150.0 — 129, or 21.0 cases to go, of the 
48 cases in the next class interval (70-71). 

3. The interpolated amount of the class-interval range is therefore 

HXl. 
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The third and first quartiles are calculated by interpolating 
in a similar manner for the values of the 3JV/4th and the Ar/4th 
cases. In the frequency distribution of the heights of 300 
freshmen, following are the values of the quartiles: 

= 68 + It X 1 = 68.774 or 68.77 in. 

Qs = 72 + ^ X 1 = 72.171 or 72.17 in. 


The Average Deviation, The average, or mean, deviation 
is a measure of dispersion that has its minimum value when 
deviations are measured from the median. To compute the 
average deviation from the median, subtract each of the N values 
of X from the median, add the absolute values of the deviations, 
and divide by iV*. Thus, 


ad - ^1^ -Mil 

A.D. - ^ 


( 6 ) 


The average deviation is simpler in concept than any other 
measure of dispersion. It is less affected by extreme deviations 
than the more popular standard deviation, and for this reason 
it probably has greater sampling reliability from extremely 
leptokurtic universes. In spite of these advantages the average 
deviation is not a popular measure of dispersion, partly because 
of several widely accepted but mistaken notions concerning its 
properties. 

It is often said that it is illogical to neglect the signs of devia¬ 
tions to be averaged and that this fallacy is avoided in the case 
of other measures of dispersion. It is true that the mean devia¬ 
tion from the median is the mean of absolute deviations from 
some average, but every other measure of dispersion is also equal 
(or proportional) to an average of absolute deviations from some 
average. The quartile deviation is the median of absolute 
deviations from the mid-quartile, and the standard deviation 
is the quadratic rnean of absolute deviations from the mean. 

It has been said that the sampling reliability of the average 
deviation is less than that of the standard deviation. This may 
be true for normal universes, but it can hardly be true for all 
types. 

Grouped Data — Mid-^alue Assumption in Calculating Average 
Deviation. When data are grouped in the form of a frequency 
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distribution with equal class intervals, the average deviation can 
be written in the simple form 


A.D. 


^Xs|W|) 

N 


(7) 


where d is the deviation of class mid-values from the mid-value 
of the class interval containing the median. Although Eq. (7) 
is the exact value of the average deviation from the median 
according to the assumption that all observations in every class 
interval are equal to the mid-value of the interval (the same 
assumption commonly used for the standard deviation), many 
statisticians consider it unsatisfactory as a formula for the 
average deviation. The chief reason for the dissatisfaction seems 
to be that the mid-value assumption, which implies that the 
median is the mid-value of the median interval, is inconsistent 
with the ordinary notion of the interpolated median. 

In applying the simple formula in practice, several corrections 
may be used, some of which will be illustrated below. Each of 
these corrections deals with a separate aspect of the problem of 
approximating the average deviation of ungrouped data from a 
frequency distribution. The two most important corrections 
are usually of the same order of magnitude, but opposite in sign, 
so that they tend to offset each other. For this reason, it is 
usually advisable to use the simpler formula without correction, 
because of its simplicity, unless the problem is of great importance 
so that minor adjustments are worth making. 

, Grouped Data—Histogram Assumption in Calculating Average 
Deviation. The average deviation of the histogram considered 
as a continuous frequency function is often used in preference 
to the simple formula for the average deviation presented in Eq. 
(7). This corresponds to the assumption on which the usual 
interpolated median is based. The median is the abscissa of 
the vertical line that divides the histogram into two equal 
areas. When the left half of the histogram is folded along this 
vertical line, over the half on the right, the average deviation 
is the first moment about the line of folding. 

To simplify the derivation, let d/i represent deviations from 
the mid-value of the median interval, and let 


Mi = L + ci 


(8) 
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where L is the lower limit of the median interval, i is the width 
of the class interval, and c is the proportion of observations in 
the median interval that are less than Mi. It is to be noted 
that the cases are assumed to be distributed uniformly through 
the interval. 

In these terms the formula for average deviation can be written 
as follows: 


F- 

i 


in which Fq is the frequency of the median interval, Ci is the 
amount of correction associated with observations above and 


observed 
below lower limit 

1 1 

cFo \(i~c)Fo j iFo 

_1_1_ 

^-Fo (/-c) observation 
above upper iimit 

-,-1—- 


ci 1 (4~cji 

Mi t 


Mi'dvGiJue 
of interval 

Fig, 74. —Illustration of distribution of cases in and above and below the median 

interval. 

below the median interval, and C 2 is the amount of correction 
associated with the median interval itself. 

To demonstrate the truth of this equation, consider the 
diagram of the median interval shown in Fig. 74. Since devia¬ 
tions from the mid-value of the median interval are (i — c)^ 
too small for observations above the median interval and (i — c)^ 
too large for those below the median interval, it follows that 

(5 - “) [f - - (1 - “f*)] (“) 

= — c)(2c — l)Fo = iFo(2c — 20 ^ — 

The area in the median interval below Mi is cFo, and its mean 
deviation from Mi is a/2. Similarly, (1 — c)Fo lies above Mi 
with a mean deviation of (1 — c)^72. Hence, 

c,. + (!-«) id-c)f’. 

= - if. (c’-c + 0 (11) 





+ Cl -b C2I = 


,).t 
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From Eqs. (10) and (11), the combined corrections are found to be 

Cl + C 2 = iF,{2c - 2c2 -i)+ iFo{c^ - c + i) 

= iFo{c — c^) = iFoc(l — c) (12) 

a result that verified Eq. (9). Equation (9) is probably the most 
convenient form available for computing the mean deviation 
according to the histogram assumption. 

Calculation of the average deviation by the use of Eq. (9) 
is illustrated by Table 17 and the ensuing analysis. 


Table 17. —Frequency Distribution of the Heights of 300 Princeton 
Freshmen, Class of 1943 
(In inches. Class interval 1 in.) 


X 

F 

d 

i 


62- 

1 

-8 

-8 

63- 

2 

-7 

-14 

64- 

1 

-6 

-6 

65- 

4 

-5 

-20 

66- 

12 

-4 

-48 

67- 

31 

-3 

-93 

68- 

31 

-2 

-62 

69- 

47 

-1 

-47 

70- 

48 

0 

0 

71- 

42 

1 

42 

72- 

35 

2 

70 

73- 

21 

3 

63 

74- 

14 

4 

56 

75- 

8 

5 

40 

76- 

2 

6 

12 

77- 

1 

7 

7 


300 


-298 

+290 

S (without regard 
to sign) = 588 


When the median and the quartiles were calculated, it was 
assumed that the frequencies were evenly distributed in the class 
intervals. This assumption is continued while calculating the 
average deviation about the median. As shown in Table 17, 
the sum of the deviations about the arbitrary origin without 
regard to sign is 588. That is, 
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where 



588 


di = Xi- A 
(^2 ^ X 2 — A 



The sum desired is the sum without regard to sign of deviations 
from the median. That is, 

s|f 

where = Zi — Mi 

.T '2 = X 2 - Mi 

rr: = Zn = Mi 

Note: x has been used to symbolize the deviations from the arithmetic 
mean; x* is used to symbolize deviations from the median. 

Accordingly, the above sum, 588, which for the illustration 
chosen is Y\F{d/i)\ can be adjusted by a calculated correction 
that will change the sum to s|F(a;V^)|. This correction is 
obtained by using Eq. (9). 

From Table 17 and the analysis on pages 221 to 223 it is to be 
noted that Fa = 48, the frequency of the interval containing 
the median; z = 1; and c = 0.44, since the median is 70.44, the 
lower limit of the interval containing the median is 70, and c is 
the proportion of observations in the median interval that are 
less than the median. Accordingly, the average deviation may 
be calculated by using Eq. (9), as follows: 



A.D.Mi 



N 

1[588 + 0.44(0.56)48] 
300 

588 + 11.83 
300 


= 2.00 in. • 


The Semiquartile Range, The semiquartile range, one-half 
of the difference between the third quartile and the first quartile, 
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is another statistic that measures variability. Its formula is 

n — ~ 

^ 2 


For the frequency distribution in Table 15, the semiquartile 
range is calculated as follows: 


Q = 


72.17143 - 68.77419 
2 


1.69862 or 1.70 in. 


Measures of Skewness, From statistics measuring variation 
and central tendencies, important measures of skewness are 
obtained. It has been noted that X — Mo is a measure of 
skewness. In the frequency distribution of 300 Princeton 
freshmen heights, 

X - Mo = 70.473333 - 70.36584 = 0.10749 or 0.11 in. 

The position of the first and third quartiles in relation to the 
median is a very convenient statistic measuring skewness, 
namely, 

Qa — Mi — (Mi — Qi) or Qi + Qz — 2Mi 

For the frequency distribution of heights of 300 Princeton 
freshmen this statistic is 

68.7742 + 72.1714 - 2(70.4375) = 0.07 in. 


COEFFICIENTS OF VARIABILITY 

The various aggregative measures of variability may con¬ 
veniently be expressed as relatives or coefficients, as explained 
in the preceding chapter; indeed, they must be so expressed if 
comparisons are to be made with other frequency distributions 
having different types of units. The aggregative measures of 
variability are converted into relatives or coefficients by dividing 
the former by the mean, the median, or the average of the two 
quartiles. For the present problem, the relative measures of 
variability that would be useful in comparing this frequency 
distribution with other frequency distributions, are as follows: 
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F, = ^ = 3.53 per cent 

X 

Fa d. = = 1-38 per cent 

-T V3 

The formula for the Vq is really the semiquartile range divided 
by the average of the two quartiles, but the 2's cancel out, leaving 
merely the difference between the two quartiles in the numerator 
and their sum in the denominator. 


COEFFICIENTS OF SKEWNESS 

Statistics measuring skewness are likewise more significant 
for comparative purposes when expressed as coefficients. The 
various coefficients of skewness for the frequency distribution 
in Table 16 are as follows: 

Based on mathematical statistics: 



-0.32764 in. 
2.48716 in. 


—0.1317 or —13.17 per cent 


9) ^ -<^-0112, or -1.12 per cent 

Note: This is given the negative sign because the third mojnent is 
negative. 


Based on other statistics (using Mo = 70.488): 


sk = 



-0.015 in. 
2.487 in. 


—0.006 or —0.6 per cent 


(If the so-called ‘‘mathematical mode,” i.e., Mo — 70.366, is used, this 
coefficient of skewness by the same formula would be -1-4.32 per cent.) 

Using the median and the two quartiles to measure skewness, 
the following result is obtained: 


, _ Qa + Qi - 2Mi _ +0.0706 in. 
® Q 1.69862 in. 


+0.0416, or +4.16 

per cent 


The difficulty of locating the mode, even when quite a large 
sample is taken, is illustrated by the frequency distribution 
analyzed in this chapter. In this illustration every mathematical 
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indication is that the mode is larger than the mean, but the non- 
mathematically calculated mode (the interpolated mode) is 
smaller than the mean. 

GRAPHIC INTERPRETATION OF STATISTICS OF VARIABILITY 

AND SKEWNESS 

Figure 75 shows on a scale the relative location of the median, 
the two quartiles, and the upper and lower limits found by taking 
Mi ± A.D.Mi, namely, 70.44 ± 2.00. The figure illustrates the 
fact that, when there is skewness, the location of the quartiles 
with reference to the median reflects the presence of skewness. 
If, therefore, the quartiles are used as measures of deviation, 
they reflect the fact that, in skewed distributions, the deviation 



68.44 70.44 72 

68.77 72.17 


Fig. 75.—Illustration of significance of average deviation and two quartiles as 
measures of dispersion. 

is skewed in one direction or the other. If the average deviation 
is used as a measure of deviation or variability, the presence of 
skewness will not be noted in the results. Whenever the 
distribution is skewed to any extent, the quartiles are unequal 
distances from the median, as may be noticed in Fig. 75. As the 
figure also illustrates, the average deviation is conceived as an 
equal distance on each side of the median. 

Figure 76 shows on a scale the relative location of the median 
and average deviation and the location of the mean and the 
standard deviation by depicting the upper and lower limits of 
Mi ± A.D.Mi (as in Fig. 75) and, in addition, the upper and 
lower limits of X ± tr. As in the case of the average deviation, 
so also in the case of the standard deviation, the measure of 
variability is conceived as an equal distance above and below 
the mean—that is, an equal distance from the mean on the 
x-sixis in both the positive and the negative directions in Figs. 






228 


ANALYSIS OF FREQUENCY DISTRIBUTIONS 


75 and 76. If the distribution is skewed to a marked extent, it 
should be evident that care must be exercised in interpreting the 
significance of the average deviation or the standard deviation. 

From Fig. 75 it is noted that the first quartile in the negative 
direction and the third quartile in the positive direction are less 
distant from the median than ±A.D.Mi. By definition, the 
limits of the range between the first and third quartiles include 
exactly 50 per cent of the cases. For a normal distribution^ the 
distance between the upper and lower limits defined by 
Mi ± A.D.Mi include approximately 58 per cent of the cases.^ 



67.98 70.47 72.96 


Fig. 76.—Illustration of the standard deviation and average deviation as meas¬ 
ures of variability. 

It will be noted from Fig. 76 that the limits X ± cr are farther, 
respectively, in the positive and negative directions from the 
mean than are the limits Mi ± A.D.mi from the median. The 
standard deviation is always larger than the average deviation; 
in fact, an approximate check^ on the accuracy of calculation may 
be used as follows: A.D. = 0.8(r. In the frequency distribution 
illustrated, this check works fairly well; for 0.8(2.49) = 1.97 and 
the calculated A.D.mi = 2.00. For a normal distribution the 
distance between the upper and lower limits defined by X ± <r 
includes approximately two-thirds (68 per cent) of the cases.^ 

FREQUENCY DISTRIBUTIONS WITH UNEQUAL CLASS INTERVALS 

As remarked earlier in this chapter, the size of the class interval 
should ordinarily be uniform throughout a given frequency 

^ See Chap. XI for description of a normal distribution. 

2 For more precise discussion and explanation, see Chaps. XI and XII. 

2 For distributions that depart widely from the normal form, this check 
may not be satisfactory. 
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distribution; but in some cases, usually because there is a large 
concentration of cases at one or the other extreme of the range, 
it is considered necessary to use different-sized class intervals for 
parts of the frequency distribution in order tp give a proper 
picture of variability. Table 18 illustrates such an instance. 
Of 150 cases distributed over the range 0-51, 106 cases fell within 
the limits 0-10. Obviously, a small number of class intervals 
of uniform size would give a wholly erroneous notion of the 
variation. Occasionally, data at its primary source will be 
published in a manner similar to that of Table 18, and the 
statistician has no choice but to utilize the material in frequency 
distributions that have unequal-sized class intervals. This is 
particularly true of statistics of wages and income and statistics 
of hours of labor. 


Table 18. —Deaths Due to Automobile Accidents in 150 Cities,* 
First 20 Weeks of 1940 


Number of deaths due to automobile 
accidents 

X 

Number of cities 
whose automobile 
accident fatalities 
were as specified 

F 

Calculations of 
deviations from an 
arbitrary origin 
(A = 15) 

i 

Intervals 

Mid-values 

0— 

0.5 

11 

-1.45 

1- 

2.0 

23 

-1.30 

3- 

4.0 

34 

-1.10 

5- 

7.5 

38 

-0.75 

10- 

15.0 

24 

0 

20- 

25.0 

12 

1.00 

30- 

35.0 

4 

2.00 

40-51 

45.5 

4 

3.05 



150 



* New York, Los Angeles, Chicago, and Detroit are excluded from these statistics. 
United States Bureau of the United States Census, Weekly Accident Bulletin^ May 24, 1940, 
pp. 1-4. 
ti = 10. 

If the mid-value of the class interval 10- is taken as the 
arbitrary origin, that is, A = 15, and the ‘‘class interval’^ or 
abscissa scale unit i is taken equal to 10 (since that size interval 
predominates), the deviations of class intervals in that part of the 
frequency distribution where class intervals are equal are readily 
determined. Where the ckss intervals are unequal, simple sub- 
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traction of mid-values and the division of the answer by the 
scale unit gives the results in the last column of Table 18. To 
illustrate the process, there is a difference of 10.5 between mid¬ 
value 35 and mid-value 45.5, a quantity 1.05 times the scale 
unit. Accordingly, the deviation advances from 2.0 to 3.05 
scale units. In the lower reaches of the range there is a difference 
of 7.5 between mid-value 15 and mid-value 7.5, or | a scale unit; 
consequently, the step-deviation change is from 0 to —0.75. 
From mid-value 7.5 to mid-value 4, the deviation recedes 0.35 
a scale unit to —1.10. From mid-value 4.0 to mid-value 1.5, a 
distance of ^ an interval, the scale-unit deviation changes from 
— 1.10 to —1.35. Finally, from mid-value 1.5 to mid-value 0.5, 
a distance of a scale unit, the scale-unit deviation recedes 
from —1.35 to —1.45. 

From this point on, the analysis of the frequency distribution 
is the same as it would be were uniform class intervals used, 
although obviously the uneven numbers add somewhat to the 
burden of filling in the work sheet according to the plan shown in 
Table 16. Once the work sheet has been completed, however, 
the fact that the class intervals are not uniform ceases to be a 
, consideration in the subsequent computations; the summation 
figures can be applied in the formulas in precisely the same 
manner as if the class intervals were uniform. 

ACCURACY IN THE CALCULATION OF STATISTICS 

Ordinary common sense would dictate that all recording of 
figures needs to be carefully checked, since there is always a 
chance of making a mistake in copying. Such mistakes are not 
statistical errors to be disregarded under the “theory of errors,^’ 
which is explained in Chap. XI. They cannot be disregarded, 
and every effort should be made to prevent their occurrence. 
The same applies to all calculations made, but frequently 
short-cut or cross checks can be devised for these. While 
accuracy is essential, a spurious accuracy may be introduced into 
final answers. For example, in most cases final figures repre¬ 
senting samples should be presented in round numbers, including 
only the significant figures in the arithmetical answers obtained.^ 

Care must be taken, however, in cases where errors are likely to 
accumulate through successive steps of calculation. It may be 

^ The meaning of “significant” is explained on p. 213, note. 
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necessary to retain the figures in a calculated result for a number 
of places beyond the significant figures if that calculated result 
is being used in the process of calculating other statistics. In 
some statistical problems it is necessary to add a constant 
successively perhaps fifty or even hundreds of tinfes, or, similarly, 
to multiply by a constant successively a large number of times. 
In such instances the constant should be written to several more 
places than will be used in the final answers in order to avoid 
an error in significant figures at the end of the process. This is a 
purely mathematical problem; in every case, the standard of 
accuracy required, or the number of significant figures, having 
been decided upon, a simple arithmetical calculation will show to 
how many places the intermediary calculations must be carried. 
The final results are then rounded off to the number of significant 
figures. 

In rounding numbers the rule is that a remainder less than 
half a unit is disregarded, while half or more than half is counted 
as an additional unit. Exactly half may be changed to the 
nearest even number—thus 174.5 would be 174 but 175.5 would 
be 176. 



PART III 


The Normal Frequency Curve 

CHAPTER VIII 
PROBABILITY 

Up to this point, the discussion has primarily been concerned 
with “descriptive statistics.” Attention has centered upon 
methods of summarizing and describing statistical variation. 
Occasionally, theory has been employed to explain certain 
methods or to indicate why one method is to be preferred to 
another; but, in general, emphasis has been upon the facts as 
such, rather than upon any theoretical explanation of or inference 
to be made from these facts. 

In contrast, the next four chapters will be primarily concerned 
vqjja. a particular body of theoretical statistics, namely, the theory 
of the normal frequency curve. The question now to be con¬ 
sidered is not “what” is the character of a given frequency dis¬ 
tribution, but “why.” The discussion will be abstract and 
general and will not pertain to actual concrete data, except by 
way of illustration. 

Before this theoretical analysis can be undertaken, however, 
certain mathematical tools must be acquired and certain funda¬ 
mental concepts clarified. That is the purpose of this and the 
next chapters. 

PERMUTATIONS AND COMBINATIONS 

Permutations Defined and Illustrated. A “permutation” is 
an arrangement. The word “man,” for example, is a special 
arrangement of the three letters m,' a, and n. Othet possible 
arrangements of these three letters are: mna, nma, nam, anm, 
and amn. All these arrangements are permutations. 

232 
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In general, if there are N different things, it is possible to form 
N\ different permutations.^ Consider again the three letters 
m, a, and n. In making various arrangements of these, it is 
possible to pick the first letter in three different ways. The 
first letter having been picked, there are then left two different 
ways for the selection of the second letter. Finally, the first 
two letters having been selected, there remains one, and only one, 
way for the selection of the last letter.. Now, each one of the two 
ways that are open for the selection of the second letter can be 
combined with each of the three ways that are open for the 
selection of the first letter, so that there are 3X2 different ways 
of picking the first and second letters. Since there is only one 
way left in every case for the selection of the last letter, there are 
therefore 3 X 2 X 1 = 6 different ways of picking all the three 
letters. Thus, the number of different permutations of three 
things is 3! = 6. If there had been 10 different letters, the 
number of different permutations of these would have been 
10X9X8X7X6X5X4X3X2X1 = 10! = 3,628,800. 

Suppose, now, that among 10 different things 3 are to be 
selected for some particular purpose, the exact nature of the 
purpose being immaterial for the analysis. The question is: 
In how many different ways may a subgroup of 3 be selected 
from the total of 10; in other words, what is the number of differ¬ 
ent permutations that can be made of 10 things taken 3 at a 
time? This question may be answered as follows: It is possible 
to select the first of the subgroup of 3 in 10 different ways, the 
second in 9 different ways, and the third in 8 different ways. 
There are thus altogether 10 X 9 X 8 different ways in which 
the 3 things may be selected from the total of 10. Accordingly, 
the number of different permutations of 10 things taken 3 at a 
time is 10 X 9 X 8 = 720. In general, the number of different 
permutations of N things taken r at a time is 

Pf = N{N — !)••• to r factors 

that is, 

Pjr = N{N - 1 )(A - 2) • • • (iV - r + 1) (1) 

Combinations Defined and Illustrated, A '^combination” is 
not the same thing as a permutation. A group of 3 letters con- 

^ iV! is to be read “iV factorial” and signifies the successive product of iV 
by all the integers less than N and greater than zero. 
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stitutes a combination of these 3 letters; but as has just been 
seen, this combination can be arranged in 3! different ways. In 
other words, it is possible to have 3! permutations of a single 
combination of 3 things. In general, it is possible to have N\ 
permutations of a single combination of N things. 

Although a group of N things forms but a single combination, 
subgroups may be picked in such a way as to constitute different 
combinations. Suppose, for example, that the board of directors 
of a given corporation consists of 10 men and the chairman. 
The chairman wishes to pick a committee of 3 men. In how 
many different ways can such a committee be constituted, the 
chairman himself being excluded? This is a question of how 
many different combinations of 3 men may be taken from a 
group of 10 men. It will be noted that the order of selection 
is immaterial, for it is only the constituency of any committee 
that differentiates it from other possible committees. 

The answer to this question is obtained as follows: Let C\^ 
represent the number of combinations to be calculated, m., the 
number of different combinations of 10 things taken 3 at a time. 
Each one of these combinations, it will be recalled, can be 
arranged in 3! different ways; i.e,, there are 3! different ways in 
which a given committee can be selected. Accordingly, the 
total number of ways in which a committee of 3, i.e., just any 
committee and not a particular committee, can be chosen is 
equal to C\^ X 3!. But the total number of different ways in 
which a committee of 3 can be picked from a group of 10 is the 
number of permutations of 10 things taken 3 at a time, which is 
equal to 10 X 9 X 8. Therefore, C\^ 3! = 10 X 9 X 8, and 


10 X 9 X 8 
^3 - 3 ! ’ 


In general, the number of different com¬ 


binations of N things taken r at a time is 


^ N(N - 1)(N - 2) - ■ (A - r -H) 
*■ r! 


( 2 ) 


or if numerator and denominator are both multiplied by (N — r)!, 




m 

r\{N — r)! 


(3) 


The Binomial Expansion. A use that is made of combina¬ 
torial theory in elementary algebra is to find a formula for the 
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expansion of the binomial {x + y)^. It will be recalled that 
(^ + 2/)^ = (ic + y)(a; + y) is found by multiplying each term 
of the first factor by each term of the second factor and adding 
these partial products. Thus 

{x + yY = x^ + xy + xy y‘^ = x^ + 2xy + y^ 

A higher powered binomial can be evaluated by mere repetition 
of this process. Thus 

(x + yY = (x + y){x + y){x + y) = (x^ + 2 xy + y^){x + y) 

— x^ + + xy^ + x^y + 2xy^ + 2/® = a;® + ^x^y -\-’^xy^ + y^ 

It will be noted that the result in each case consists of a series of 
terms in diminishing powers of x (or rising powers of t/), and this 
is generally true no matter what the power of the binomial. 
It will also be noted that the number of times a given term 
occurs (z.e., its coefficient in the expansion) depends on the 
number of ways the x^s (or y^s) that make up that term can be 
selected from the different factors. Thus in the case of {x + yY 
the term composed of three x^s, that is, x^, can be formed in only 
one way, namely, by taking an x from each of the three factors. 
The term x^y, however, which contains two x's, can be formed 
in three ways. This is because the number of different com¬ 
binations that can be made of three x’s taken two at a time.,^ 
3 • 2 • 1 

2 —j—j = 3. Similarly, the coefficient of xy^ is the number of 

different combinations of three x^s taken one at a time, which is 
3*21 

j-; 2 T 1 ~ Accordingly, the expansion of (x + yY might 

be written (x + yY = Clx^ + C\x^y + C\xy^ + Cly^, where Cl 
means the number of combinations of three things taken three 
at a time, C\ equals the number of combinations of three things 
taken two at a time, etc.,^ the evaluation of these quantities to be 
determined by Eq. (3). If consideration were given to the 
power of y instead of x, this new method of writing the expansion 
of {x + yY would become 

{x + yY = Clx^ + Clx^y + C\xy^ + C\y^ 


^ Note that 0! is taken by convention to be 1, so that Cl « 1. 
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In general, 

(x + y)" = + • • • + 

+ + C^y« (4) 

or, on using the second method of expression, 

(x + = Coa;^ + CiX"-^y + • • • + C^-tX^y”~^ 

+ + CnV^ (4a) 

Thus 

(x + 2/)* = C\x*‘ + Ctx^y + C\x^^ + C\xy^ + C^y*^ 

= + y^ 

and 

{x + yY Clx^ + C\x^y + C\xY + Clx^y^ + C\xy^ + Cly^ 

= a:® + bx^y + lOx^y^ + IQx^y^ + ^xy^ + y^ 

It is in this way that the combinatorial formulas enter into the 
binomial expansion. Later it will be seen that a certain fre¬ 
quency distribution is called a ‘‘binomial distribution^^ because 
its relative frequencies are computed in the same way as the 
coefficients of the terms of a binomial expansion. 

MATHEMATICAL PROBABILITY 

The concept of probability has been the subject of much 
debate among philosophers, mathematicians, and statisticians. 
To enter into this debate, however, would be beyond the scope 
of this book.^ Although the concept of probability presented 
below appears to be the most suitable for an elementary text 
and is apparently the one most in favor among statisticians, it 
must not be thought that other approaches are necessarily 
invalid or even possibly less fruitful. ^ 

^ A brief review of the classical theory and the frequency theory of R. von 
Mises is presented in the Appendix, pp. 242-251. 

2 The concept of probability presented in this book is patterned after that 
presented by J. Neyman in his Lectures and Conferences on Mathematical 
Statistics (Graduate School of the United States Department of Agriculture, 
Washington, 1937). His views, Dr. Neyman believes, ^^are shared by E. S. 
Pearson and other workers attached to the Department of Statistics at 
University College, London.” He also refers to H. Cramer, Random 
Variables and Probability Distrihvtions (Cambridge, 1937); Maurice Frechet, 
Recherches theoriques modemes sur la theorie des prohahilitSs (Gauthiers- 
Villars, Paris, 1937); A. Kolmogoroff, Grundhegriffe der Wahrscheinlich- 
keitsrechnung (Julius Springer, Berlin, 1933); and D. J. Struik, “On the 
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Definition. A discussion of probability can best begin with a 
finite set of objects. Suppose that, in a given set of t objects, m 
possess a given property and n do not possess this property. 
Then tha probability of an object of this set having the given 
property is or the relative frequency of these objects in the 
set. The word '^object” as used in this definition is to be 
interpreted broadly. Besides objects proper, it may be taken 
to include events that haye the property of occurring or even 
propositions that have the property of being true. 

To illustrate the above definition of probability, consider an 
ordinary deck of 52 playing cards. This will have 26 red cards 
and 26 black cards; hence the probability of a red card in this 
deck is ff = i- The deck also contains 13 cards of each suit, 
so that the probability of a heart, say, is if = i- This is also 
the probability of a diamond, or a spade, or a club. 

Description of Fundamental Probability Set. It should be 
especially noted that in defining a probability the set of objects 
to which it pertains must be precisely designated and the prop¬ 
erty of an object to which the probability refers must be care¬ 
fully distinguished. For example, the probability of an ace in 
a pinochle deck^ i s ts — ^ and not ^ = A; as it is in an ordinary 
deck. Furthermore, for the same set of cards, the probability 
of a card of a given color is not the same as the probability of a 
card of a given suit or of a card of a given value. What is more 
important is that each of these properties and hence their prob¬ 
abilities pertain to a different classification of the objects of the 
set. As will be seen later, it is possible to add probabilities per¬ 
taining to the same classification of the objects of a given set, 
but not probabilities pertaining to different classifications, even 
though the set of objects is the same. A set of objects classified 
in a given way is called a ^ ^fundamental probability set.” In 
all calculations it is very important to define carefully theTunda- 
mental probability set that is involved. 

In this connection it should be noted that the “probability of 
a heart in an ordinary deck of cards is not necessarily the same 
thing as the “probability of drawing a heart from the deck.'^ 

Foundations of the Theory of Probabilities,” Philosophy of Science (1934), 
Vol. 1, pp. 50-70. 

^ A pinochle deck consists of 2 aces, 2 kings, 2 queens, 2 jacks, 2 tens, and 
2 nines of each suit. There are no cards of lower value. 
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For the former, the fundamental probability set is precisely 
designated; it is simply the given deck of cards classified accord¬ 
ing to suit. The total number in the deck may be readily 
counted; the hearts may be easily separated from the others; 
and their relative frequency, i.e., their probability, may be 
directly computed. But what is the fundamental probability 
set to which the '^probability of drawing a heart from the deck^' 
pertains? To this there are several answers. 

Suppose that 100 drawings are made from the given deck, 
the card drawn each time being replaced in the deck and the 
whole well shufHed before the next drawing. Let the number 
of hearts so drawn be 20. Here the fundamental probability 
set is the set of 100 drawings classified according to suit, and 
the probability of a heart in this set is xw = this case 

also, the total number of objects can be counted and the num¬ 
ber having the given property can be readily ascertained. 

The "probability of drawing a heart from the deck’^ i^iay, 
however, pertain to a set of 100 drawings to be made in the 
future. Here the total number of " objectsin the set is given, 
but there is no way of ascertaining how many of these drawings 
will yield hearts. In this case, the "probability of drawing a 
heart from the deck’^ is simply unknown. 

Finally, the "probability of drawing a heart from the deck^' 
may pertain to a set of hypothetical drawings, not actual draw¬ 
ings. If, it may be argued, 100 drawings should be made from 
the deck in the prescribed manner and if 30 of these should be 
hearts, then the probability of a heart in this assumed set would 
be 1 ^. The "probability of drawing a heart from the deck’^ 
refers in this case to a hypothetical set. 

Infinite Probability Sets. Frequently, probability theory is 
concerned with an infinite set of objects. These are usually 
hypothetical sets but may in some cases be real sets, such as the 
infinity of points on a line. Without going into mathematical 
refinements, it may be said that the probability of an object 
of a given property in an infinite set is the percentage of such 
objects in the set. For the percentage of a particular kind of 
object in an infinite set may be finite even if the number of 
objects of the given property and the total number of objects 
are both infinite. For example, if a coin is tossed indefinitely, 
both the total number of tossings and the number of tossings 
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yielding heads may be increased without limit. Nevertheless, 
the ratio of the number of heads to the total number of tossings 
will stay within finite limits no matter how many tossings are 
made. For an infini t | e flif i for a finite set^ 

the pr obability of an object having a p articular^ property is the 
relative frequency o f such objects in the given set.^^ 7 

PROBABILITY AND THE RELATIVE FREQUENCY 
OF ACTUAL EVENTS 

In concluding this chapter a few words should be said about 
the r elationship betw een mathematical probability and the 
relative frequency ^ actual events^ 7^^ defined above, prob¬ 
ability is a COlistant characterizing a given set of objects; it is 
merely a mathematical abstraction. If the theory of prob¬ 
ability is to be of any practical use, however, it must be tied to 
the relative frequency of actual events. It must help, in other 
words, in making predictions about real life. 

The Law of Large Numbers, The link that ties mathematical 
probability to the relative frequency of real events is actual 
experience with mass phenomena. This experience has been 
called the ^Taw of large numbers,^^ which says that, when a large 
number of random events is involved, it is usually possible to 
predict, with reasonable accuracy, the relative frequency of 
occurrence of a particular event by calculating a certain mathe¬ 
matical probability. To illustrate, consider once again an 
ordinary deck of playing cards. Mathematically, this can be 
looked upon as a set of 52 objects for which the probability of a 
heart is ^ = i- Let a large number of drawings, say 1,000, be 
made from this deck, the card drawn each time being replaced 
and the deck well shuffled before the next drawing. As already 
pointed out, no exact statement about the number of hearts 
drawn can be made in advance of the drawings. Experience 
shows, however, that in random drawings of this kind the relative 
frequency of hearts drawn approximates fairly well the mathe¬ 
matical probability of a heart in the deck. Hence, in the given 
instance it may be predicted that of the 1,000 random drawings 
something close to 250 will be hearts. 


^ For a more refined definition of probability, see Neyman, op. dt., pp. 
10 - 11 . 
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The foregoing is a very simple illustration of the law of large 
numbers. The law appears to be equally valid, however, for 
more complicated calculations of probability. For example, 
suppose there are two decks of cards, one an ordinary deck and 
the other a pinochle deck, and suppose that all possible combina¬ 
tions of two cards are made by combining one card from the 
ordinary deck with one card from the pinochle deck. Since the 
first card can be picked in 52 ways and the second in 48 ways, 
there will be 62 X 48 = 2,496 such combinations. Of these 
2,496 combinations, 4 X 8 = 32 will be pairs of aces; hence, in 
this set of combinations, 32/2,496 = is the probability of a 
pair of aces.^ Now let a very large number of drawings be made 
from each deck of cards, the card drawn each time being replaced 
and the deck well shuffled before the next drawing. Further¬ 
more, let the first card drawn from the ordinary deck be paired 
with the first card drawn from the pinochle deck, the second 
card from the ordinary deck with the second card from the 
pinochle deck, etc. Then, if the number of random drawings 
is very large, experience shows that the pairs of aces actually 
occurring in this large set of drawings will be close to ^ times 
the total number of drawings. Again the relative frequency of 
actual events can be approximately predicted by the computation 
of a mathematical probability. In factp if random mass phe¬ 
nomena are iny glyed, thejvhole of tha_c alculus of probability can 
Jbe em^oyed m th e prefflctiqn^f relatiy^ requencies j^h sati^- 
jactory^accuracy^ 

Empirically Determined Prohahilities, It might be pointed 
out in passing that in many instances the original set of objects 
is not completely known and the probability of a given property 
of the set must be determined empirically. For example, the 
total number of deaths in the United States of white males, age 
fifty, is not completely known. Indeed, so far as we know, 
deaths of men, age fifty, will continue to occur indefinitely. Thus 
of the total number of men who have reached and will reach the 
age of fifty, the number who have died or will die during their 
fiftieth year is not precisely known. On the basis of the law of 
large numbers, however, it seems safe to assume that the many 
vital statistics that have been accumulated give a very close 
approximation to the true probability of death at age fifty. That 

1 Cf. Chap. X. 
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this assumption is justified is again verified by actual experience. 
Thus, if the empirically determined probability of a man dying 
at age fifty and the empirically determined probability of his wife 
dying at age fifty are used to calculate^ the probability of both 
a man and his wife dying at the age of fifty, experience with large 
masses of data shows that the relative frequency of such pairs 
of deaths at age fifty does actually approximate the calculated 
probability. The calculus of probability can thus be used by 
life-insurance companies with general success. Similar results 
have been found true of other empirically determined probabili¬ 
ties. The law of large numbers thus appears to be universally 
valid. 2 

Randomness ” It will be noted that the law of large numbers 
applies only to mass phenomena that are random.This is 
very important. If it happened, for example, that, in drawing 
pairs of cards from an ordinary deck and a pinochle deck, some 
method of selection were used that caused aces to appear in some 
cyclical order, say an ace on every tenth draw from the ordinary 
deck and on every fifth draw from the pinochle deck, then the 
relative number of pairs of aces occurring would not equal the 
computed mathematical probability. For, in this case, pairs 
of aces would occur on every tenth draw, and the probability 
of a pair of aces in the infinite set of drawings would be and 
not as computed above. 

“Randomness'' cannot be exactly defined. Fundamentally, 
it is an intuitive concept. General notions suggest that to be 
random the occurrence of an event must be related in no way to 
its property; c.§f.,^the drawing of an ace must be unrelated to its 
being an ace. Nor must a random series of events show any 
relationship between the members of the series. In other words, 
events must occur in complete disorder; they must be unpre¬ 
dictable by any formula. But, after all, these are negative 

1 See Chap. X. 

^ T he association between mathem at ifia.! pr oh^ ibility and the relative^ 
*trequency real events i^not essentially different from the associaSon 
between mathematical models in other sciences and happenings in the real 
world. In physics, for example, the closeness of the association is good 
enough to enable mathematical formulas to be used in the construction of 
bridges, automobiles, and the like, In other words, the justification of the 
theory is that it works. 
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criteria. The positive content of randomness must be left 
undefined.^ 

APPENDIX 

A REVIEW OF THREE IMPORTANT CONCEPTS OF PROBABILITY 

As pointed out in the main body of this chapter, various concepts of 
probability are admissible. Altogether there appear to be three principal 
concepts that have contended for acceptance by scientists and philosophers. 
These may be described as the “classical concept,'' the “frequency concept," 
and the “intuitive-axiomatic approach" to probability. It is this last 
that is used in this book. Since it is an outgrowth of the conflict between 
the other two lines of thought, they will be discussed first. 

Classical Concept of Probability. Historical Background. Although 
commercial insurance was practiced by the Babylonians and was well known 
to the Greeks and Romans, the development of a theory of probability, 
such as that on which modern insurance practice is based, dates back only 
to the seventeenth century. Furthermore, it was not in the field of business 
that the seeds of this probability theory were sown, but in the gambling 
rooms of the French gentry. In 1654, Antoine Gornbaud, chevalier de 
M4re, a French gentleman with an interest in mathematics, called upon the 
French mathematician Pascal for the solution of a particular gambling 
problem. The ensuing mathematical speculation marked the beginning 
of the investigation of games of chance. Subsequently there appeared 
various works by Huygens (1657), Jacques Bernoulli (1713), De Moivre 
(1718), and Bayes (1764), most of which were concerned with the applica¬ 
tion of the theory of permutations and combinations to the calculation of 
probabilities associated with various dice and card games. 

Meanwhile, French and English experimentalists, mathematical physi¬ 
cists, and astronomers were concerning themselves with errors of measure¬ 
ments. Simpson (1757) examined the implications of taking the mean of 
a set of astronomical measurements as the best estimate of the true value, 
and Lagrange (1770) published a memoir dealing with the “probable 
error" of the mean.^ Other names associated with the early development 
of the theory of errors are Boscovich, Lambert, Euler, Daniel Bernoulli, 
and Legendre. ^ The development of such concepts as * ^ inverse probability'' 
and probability of “causes" also led at this time to growing philosophical 
speculations on the theory of probability. Furthermore, the collection of 
mortality statistics led to the computation of mortality tables and the 
development of actuarial science. 

All these investigations—^the analysis of gambling games, the formulation 
of a theory of errors, and philosophical speculation—^reached their culmina¬ 
tion in the great work of Laplace, Thiorie analytique des prohabiliUs (1812). 

^ See Smith and Duncan, Sampling Statistics j pp. 155-162, for a discussion 
of various methods employed to get random samples. 

2 Cf. Levy, H., and L. Roth, Elements of Probability (1936), pp. 5-6. 

® Cf, NACfEL, E., Principles cf the Theory of Probability, p. 10. 
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This master synthesis contains all the essentials of the classical theory of 
probability and most of the important deductions from it. From the time 
of Laplace, developments of probability theory in the fields of philosophy; 
logic; mathematics; physical, chemical, and biological research; and the 
social and industrial arts and sciences were all bound to react on each other 
and to build on the same broad foundation. ^ In the ^ense that he thus 
fused together the various lines of development, Laplace may be looked upon 
as the formulator of the classical theory of probability. 

The Classical Concept, The definition of probability given by Laplace 
and generally adopted by disciples of the classical school runs as follows: 
Probability, it is said, is the ratio of the number of “favorable” cases to the 
total number of equally likely cases. For example, if a coin is tossed, there 
are two equally likely results, a head or a tail; hence the probability of a 
head is J. If a die is thrown, there are six equally likely results, and the 
probability of any particular one of these results, say a five, is therefore -J. 
Or again, if a die is thrown, the probability of obtaining an even number is 

1 = i, for three of the six equally possible results are even numbers. This 
last example illustrates how the classical theory derived the addition 
theorem. For it will be noted that the probability of getting a particular 
one of the even numbers is in each case But it has just been shown that 
the probability of any even number isf = i + | + hence, the theorem 
follows that the probability of any one of a number of mutually exclusive 
events is the sum of their individual probabilities. 

Still another example will illustrate how the classical concept led to the 
multiplication theorem. Suppose three coins are tossed. Since either 
one of two results on the first coin can be combined with either one of two 
results on the second coin and any one of these combinations can be com¬ 
bined with either one of two results on the third coin, there are altogether 

2 X 2 X 2 = 8 equally possible results. The number of these eight possible 
combinations that have all three heads is 1. Hence, the probability of all 
three heads on the tossing of three coins is i. But this is the same as 
the product of the individual probabilities of a head on each coin, i.e.y 
( 2 )(i)( 2 ) = h In general, the probability of the joint occurrence of inde¬ 
pendent events^ is the product of their individual probabilities. These are 
all illustrations of how the classical theory of probability, in line with its 
definition of the term, sought in every case to resolve a problem into a set 
of equally likely cases and then by the application of combination formulas 
to determine the number of “favorable” cases. 

Criticism of the Classical Concept. There is little criticism of the theo¬ 
rems built up by the calculus of probabilities on the basis of the classical 
definition insofar as they represent merely a set of logical relationships. 
Generally, the same set of relationships can be demonstrated on the basis 
of other definitions of probability. Criticism of the classical concept centers 

^ Cf. Levy and Roth, op. cit., p. 8. 

* The independence of the individual events is necessary for this theorem 
to hold true. In the given example, the probability of getting a head on 
any one coin is independent of the results obtained on the other two. 
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rather in the meaning of the results obtained and the adequacy of the theory 
for handling problems outside the field of gambling games, as in the statis¬ 
tical analysis of physical, biological, and economic data. 

Meaning of Equally Likely,^* The principal line of attack on the 
classical concept is directed against the terms ‘^equally likely cases. 
What does equally likely^' mean, it is asked. Is not ‘‘equally likely^' 
merely another way of saying “equally probable,*' and in that case is not 
the classical definition of probability a circular one, since it defines probabil¬ 
ity in terms of itself? To avert criticism, some rule must be laid down for 
the determination of “equally likely" or “equally probable" that is inde¬ 
pendent of “probable."^ What then were the rules of the classicists for 
determining equal likelihood? 

In the development of the classical theory, two procedures were offered 
for the determination of “equally likely" cases. One was the principle 
of sufficient reason and the other the principle of indifference^ or the principle 
of the equal distribution of ignorance, as it was sometimes called. The first 
procedure was followed when a person examined all available evidence 
relevant to the event in question and noted that this evidence was symmetri¬ 
cal with reference to the various possible results. For example, after a 
thorough examination of a die, including a nice determination of its center 
of gravity and the moments of inertia about various sides, it might be con¬ 
cluded that the investigator had sufficient reason to consider the die perfectly 
symmetrical and hence the six possible results equally likely. According 
to the principle of indifference, on the other hand, if the investigator knew 
nothing about the die in question, he had no basis for deeming one side of 
the die to be different from any other and could therefore assume them all 
to be equally likely. This second procedure is subject to particularly 
severe criticism and will be discussed first. 

Principle of Indifference, Total ignorance about a thing, it may reason¬ 
ably be argued, can scarcely be a source of any knowledge concerning it, 
even of that uncertain kind afforded by a probability statement. In other 
words, how can something be got out of nothing? As might be expected, 
the use of the principle of indifference has frequently led to paradoxical 
results. This has been generally true whenever the set of “equally likely" 
cases was not discrete but was represented by a continuous variable. Sup¬ 
pose, for example, it is known that the weight of a certain man is at least 
equal to that of his wife but is not more than double her weight. If ignor¬ 
ance as to the exact ratio of weights is “evenly distributed," it may be 
concluded that any ratio of the man's to the woman's weight lying between 
1 and 2 is as likely as any other within that interval. From this it follows 
that the probability of its lying between 1 and 1.5 is 50 per cent and that the 
probability of its lying between 1.5 and 2 is also 50 per cent. Suppose, 
however, the ratio of the woman's to the man's weight had been taken as 
the variable. The limits would then be 0.5 and 1, and according to the 
principle of indifference all possible values of the ratio lying within this range 
might be deemed equally likely. Then, however, it may be concluded that 

^ Cf, Nagel, op, cit,, p. 46. 
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the probability of the ratio of the woman’s weight to the man’s weight lying 
between 0.5 and 0.75 is just 50 per cent and that the probability of its lying 
between 0.75 and 0.1 is also 50 per cent. But this second result is in dis¬ 
agreement with the first, for a ratio of the woman’s to the man’s weight 
equal to 0.75 is the equivalent of a ratio of the man’s weight to the woman’s 
weight equal to 1.33. Thus, according to the second method of distributing 
ignorance, the probability is 50 per cent that the man is 1 to 1} times as 
heavy as the woman; and, according to the first method, the probability 
is 50 per cent that he is 1 to 1^ times as heavy. In general, when the 
principle of indifference is employed to determine what is equally likely, 
a change in the coordinates used to describe a continuous variation in 
possible results frequently affects the values of the computed probabilities.^ 

Principle of Sufficient Reason. The first method of procedure, viz., that 
of sufficient reason, puts the theory on a more solid basis. It has been 
criticized, however, in respect to its practical applicability. Even after 
the symmetry of a die has been carefully determined, it is still necessary to 
note any lack of bias in the method of throwing it or again in the surface 
on which it rolls. To determine these a priori are matters more difficult 
than the symmetry of the die itself. Much greater, however, is the difficulty 
of determining equal likelihood when attention is turned from dice, coins, 
and cards to the phenomena of the scientific laboratories and of everyday 
life. An insurance company insures the lives of a thousand men; how can 
it determine a priori whether they are all equally likely to die during the 
year? Some men are tall, others short; some are fat, some thin; some work 
outdoors, others indoors. Is there any possibility of the insurance com¬ 
pany telling, other than from its actual experience with men of various 
classes, i.e., a posteriori, whether these individual differences destroy the 
equal likelihood of death? Critics of the classical theory answer with 
an emphatic ^^no.” It is easily seen, they say, why the classical theory 
developed out of a study of games of chance, for it is in that field alone that 
there is any reasonable possibility of determining a priori whether a set of 
possible results are ‘^equally likely.” In most other fields it is impossible.* 

But why insist on a rational a priori determination of equal likelihood, 
the reader may ask. Why not determine it a posteriori? For example, to 
determine whether a head or a tail is equally likely for a given coin, why 
not toss the coin a large number of times and note whether the number of 
heads is approximately equal to the number of tails? This sounds simple 
enough until it is studied more closely. Then the question immediately 
arises: How good an approximation is necessary—^how close to 0.5 must the 
ratio of heads to tails be—^before it can be concluded that a head and a tail 
are equally likely? On the assumption of equal likelihood, the classical 
theory itself explains that in a finite number of tosses any given result is 
possible, although of course all results are not equally probable. If N 
tosses are made, for example, there are 2^'^ equally possible results,® and of 

^ Cf. VON Mises, Richard, Probability, Statistics, and Truth (1939), pp. 
114-115. 

^ Cf. VON Mises, op. cit, pp. 98-110. 

\Cf. p. 274. 
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these the number that would have r heads would be the number of combina- 

N! 

tions of N things taken r at a time,^ or C^ = Hence, any 

A7 / 1 \ 

value of r/N would be possible with a probability of ^ 2 ^ r 

N is large, the probability that r/N should deviate considerably from 0.5 
is very small and the general reasonableness of the hypothesis of equal 
likelihood could be tested by determining the probability of as large a 
deviation from 0.5 as is actually found.* 

The empirical results therefore do not provide a certainty that a head 
or a tail is equally likely. There is no way of telling for sure whether the 
deviation from an exact value of 0.5 is due merely to chance or whether the 
coin is actually biased. . A similar result might have been produced by a 
biased coin. In fact, the latter might on occasion produce an exactly equal 
number of heads and tails, so that, even if the ratio of r to iV is exactly 0.5, 
it is not certain that the coin is perfectly unbiased. If, as suggested above, 
the hypothesis of equal likelihood is accepted or rejected on the basis of the 
probability of getting the given deviation from 0.5, then equal probability 
is being determined in a way that is dependent on the concept of probability 
and the criticism of circularity in the definition becomes immediately valid. ® 
A still further criticism is this: Suppose that after a careful examination 
of a die it is found that the die is not symmetrical, what then? If a die is 
biased, how can the problem be resolved into a set of equally likely cases? 
True, if it can be determined that, through careful weighing, balancing, 
rotating, etc., the occurrence of an even number is twice as likely as the 
occurrence of an odd number, it might be argued that there are nine equally 
likely results, one of which is a one, two of which are twos, one of which is a 
three, two of which are fours, one of which is a five, and two of which are 
sixes, and the various combinatorial formulas might be based on this assump¬ 
tion. Even if the possibility of making such an a priori determination is 
not questioned, there still remains the problem of how to treat the case where 
the bias is such that a given result or results is 1.5 or 3.67 or tt times as 
likely as some other result. Laplace himself attempted a solution of this 
problem but failed to obtain a correct answer.^ It would seem that the 
problem is insoluble on the basis of the classical concept.^ 

Subjective Character of Classical Concept. Since the foregoing criticisms 
have in a way implied that probability” was more or less objective in 
character, in all fairness to the classical theory it should be pointed out 
that Laplace and most of his followers took a subjective view of the concept. 

1 Cf. p. 234. 

2 Cf. pp. 248-249 for a further discussion of this. 

2 The ^frequency” concept expounded below does not suffer from this 
criticism because, according to this concept, probability is defined as the 
limit that the ratio of heads to total results approaches as the number of 
tosses is indefinitely increased. No circularity arises from the need of 
determining equal probability. See pp. 250-251. 

^voN Misbs, op. cit.y p. 102. 

2 Cf. Nagel, op. cit., p. 46. 
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Probability to them was a ‘‘rational deg|‘ee of belief.” They considered 
the word “probability” as meaning a state of mind regarding a given 
statement, a future event, or any other thing about which absolute knowl¬ 
edge was not to be had.^ It was not made clear, by the classicists, how¬ 
ever, just how subjective their concept of probability was. If it were a 
mere measure of degree of (psychological) belief, the theory of proba¬ 
bility became a part of the science of psychology and immediately the 
question arose as to how degrees of belief could be added and multiplied, 
as called for by the probability calculus. On the other hand, if “rational 
degree of belief” was to be interpreted as what every intelligent person 
ought to believe under the given circumstances, then the theory assumed 
a certain degree of objectivity and all the foregoing criticisms became 
applicable with respect to the exact content of this standard of “oughtness.”* 
It is the contention of the critics of the classical concept that, if probability 
theory is to be of practical use in statistical science, some objective definition 
of probability should be adopted. One writer points out that physical 
thermodynamics had its starting point in the subjective impressions of 
hot and cold but that its development began when an objective method was 
used to compare temperatures by means of a column of mercury. ^ In the 
same way, he concludes, probability should be put upon a physical, 
objective basis. 

Frequency Concept of Probability. If the reader will toss an ordinary 
coin a large number of times, he will see that the ratio of the number of 
heads to the total number of tosses will be close to 0.5000, and approximately 
the same result will be obtained each time the experiment is repeated. This 
empirical fact, that in mass phenomena the relative frequency of a given 
attribute often appears to approximate a definite constant, is the corner¬ 
stone of the frequency concept of probability. The constant value that 
the relative frequency tends to approximate is identified with the “proba¬ 
bility” of the given attribute. 

This frequency approach to the theory of probability goes back at least 
to the work of J. Venn on the Logic of Chance published in London in 1886. 
In the present day, a leading exponent of this view is Richard von Mises, 
whose writings^ constitute one of the most important formulations of the 
frequency theory of probability. The next section will be devoted to an 
exposition of his ideas. A subsequent section will discuss the “intuitive- 
axiomatic” approach, which is similar to von Mises' theory but differs 
somewhat in its logical basis. 

Concept of von Mises, In von Mises' theory, probability is defined only 
with reference to what he calls a “collective.” This is an infinitely large 
set of “random” elements that possess certain specified characteristics. 

^ Cf Nagel, op. cit., p. 44. 

* Cf. Naoel, op. cit., p. 46. 

3 VON Mises, op. cit., p. 112. 

* The most important of these are Wakrscheinlichkeitsrechnung (Leipzig, 
1931) and Probabilityj Statistics, and (1939), the latter being a trans¬ 
lation of an earlier (1928) German edition. 
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The sequence of results obtained from an indefinite tossing of a coin or 
throwing of a die, the set of human births running back into the indefinite 
past and projected into the endless future, the sequence of parts turned 
out by the continuous operation of a given manufacturing process^—^all 
these are examples of collectives. If the elements of such sequences take 
on varying attributes, such as heads or tails, male babies and female babies, 
acceptable and unacceptable parts, the first essential characteristic of a^true 
collective is that the relative frequency with which a particular attribute 
occurs shall approach a fixed limit as the number of elements in the collective 
is indefinitely increased. Mathematically this means the following: If r/N 
is the relative frequency of a given attribute among N elements, e.g.j the 
number of heads among N tossings of a coin, and if p is the limit that 
r/N approaches as N is increased indefinitely, then after some point, say 
N = 1,000,000, the difference between r/N and p becomes, and thereafter 
remains, less than an arbitrarily chosen positive quantity e, say 0.005. 
Numerically it means that if r/N is calculated to a given number of decimal 
places, as N is increased a point is finally reached after which further 
increases bring no change in the calculated figures. For example, if the 
ratio of heads to total tossings is calculated to three decimals, then, after 
some number of tossings, this ratio will always give r/N = 0.500. The 
limit that the relative frequency of a given attribute approaches as the 
number of elements of a collective is indefinitely increased is defined as the 
probability of that attribute in the given collective. 

The second characteristic of a true collective is its randomness.^' Thus 
the sequence of elements constituting a collective must be free from any 
regularity; they must be in complete disorder. It is to be noted that the 
relative frequency of an attribute may approach a limit in a given sequence 
without that sequence being a random one. If, for example, some special 
apparatus were constructed so that every fifth tossing of a coin resulted in a 
head and every other tossing in a tail, the sequence of results would look 
as follows; 

TTTTHTTTTHTTTTHTTTTHTTTTHTTTTHTTTTHTTTTH .... 

The limit of relative frequency of heads in such a sequence would be |, but 
the sequence is obviously not a random one. It is consequently not a true 
collective, and it cannot be said that the probability of a head under the 
given conditions is J. Actually the probability of a head on every fifth 
tossing is 1, and on every other tossing it is 0. 

What precisely constitutes “randomness”? von Mises' answer to this 
fundamental question is as follows: If subsequences of elements are picked 
from the original sequence in such a way that the selection of a particular 

1 In reality, manufacturing processes change materially from time to time. 
What is envisaged here is one that remains exactly the same indefinitely. 
For a discussion of the statistical aspects of manufacturing processes, see 
W. A. Shewhart, Statistical Method from the Viewpoint of Quality Control 
(Graduate School of the United States Department of Agriculture, 
Washington, 1939). 
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element is independent of the attribute assumed by that element and if in 
all possible subsequences of this kind the limit of relative frequency of a 
given attribute is the same as in the original sequence, the latter may be 
said to be random. In the truly random tossing of a coin, for example, the 
selection of every fifth tossing would yield a subsequence of tossings in 
which the relative frequency of heads would approach the same limit (J 
if the coin is unbiased) as in the complete sequence. If such a method of 
selection were applied to the particular sequence of heads and tails given 
above, the result would obviously be a subsequence consisting of all heads, 
for which the limit of relative frequency would be 1 and not J as in the 
original sequence. This sequence clearly fails to meet the test of random¬ 
ness. Another example of randomness is provided by the game of roulette. 
If the results of the game are influenced solely by chance forces, t.e., if they 
constitute a truly random sequence, there is no way of placing bets so as to 
secure better than average results; no formula can be devised to ‘^beat the 
house.’* As von Mises puts it, the existence of randomness means the 
impossibility of devising a gambling ‘^system.” 

In summary then, a true collective is a mass phenomenon or an endless 
sequence of observations for which (1) the relative frequencies of the par¬ 
ticular attributes of the elements of the collective tend to fixed limits and 
(2) these fixed limits are the same for any place selection of a subsequence, 
i.6., a selection that depends only on the location of an element in the col¬ 
lective and not on the attribute it assumes. The existence of such a collec¬ 
tive is the fundamental postulate of von Mises’ theory of probability. 

Criticism of von Mises^ Concept Since the initial formulation of his 
theory in 1919,^ von Mises’ concept of probability has been the subject of 
considerable discussion. Some have sought to refine and elaborate von 
Mises’ views; others have contended that they contain serious logical 
inconsistencies.2 Here only a brief mention will be made of these criticisms. 

Since in real life only finite series can be observed, there has been some 
objection to a concept of probability based upon the notion of infinite 
sequences. One writer^ has attempted to work out von Mises’ ideas, using 
finite series, but the complications are great and the results are not so 
comprehensive. After all, the concept of an infinite series only aims to 
give approximate results. It has been a useful mathematical tool in many 
other sciences; so why not in probability?^ 

Some writers have attacked the existence of limiting values.® For 
example, in accordance with the classical theory of probability, if an unbiased 
coin is tossed N times, there is always a probability, however small, that the 

^ See Mathematische Zeitschrift, Vol. 5. 

2 See list of references in von Mises’ Probability, Statistics, and Truth, pp. 
316-318, references 35-51. 

3 Blume, Johannes, . axiomatischen Orundlagung der Wahrscheinlich- 
keitsrechnung, 1934 (Dissertation Munster); Zeitschrift fiir Physik, Vol. 92 
(1934), pp. 232-252; Vol. 94 (1935), pp. 192-203. 

^ Cf. VON Mises, Probability, Statistics, and Truth, pp. 121-122. 

® Cf. Fry, T. C., Probability and Its Engineering Uses, pp. 88-91. 
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coia will turn up heads in all, or in a large proportion of, the N tossings. 
It is consequently argued that, whatever point is selected in the infinite 
sequence of tossings, there is always the possibility that the next N tossings 
will turn up such a large proportion of heads that the ratio of heads to total 
tossings will differ from the supposed limit p by more than the arbitrarily 
selected quantity e and hence contradict the mathematical criterion for the 
existence of a limit. ^ The answer to this criticism is that it is based upon 
another (the classical) concept of probability and is a proposition concerning 
the possible results obtained from a finite number of tossings. It is not 
in contradiction to another proposition that begins with a different view of 
probability and postulates the existence of a limit in an infinite sequence of 
tossings.® 

More serious criticism of von Mises* theory has been directed against the 
condition of randomness. There is the question, for example, whether a 
series that is in complete disorder and cannot accordingly be described by a 
mathematical formula can logically be conceived to exist. Still further, 
there is the question whether limits to relative frequencies in an infinite 
sequence can coexist with von Mises’ definition of randomness. Recent 
mathematical investigations, however, appear to have resolved this diffi¬ 
culty. It is claimed that, with a more carefully drawn and slightly less 
comprehensive definition of randomness, the type of collective described by 
von Mises can for all practical purposes be conceived to exist.® 

THE INTUITIVE-AXIOMATIC APPROACH TO PROBABILITY 

The theory of probability presented in the main body of this text is based 
upon the intuitive notion that theorems derived from axioms relating to 
relative frequencies approach satisfactorily the occurrence of events in 
real life. It may therefore be called the intuitive-axiomatic*' approach 
to probability. It is the concept of probability accepted by such men as 
Neyman, Frechet, and Kolmogoroff.'* Lying between the classical concept 
of Laplace and the pure frequency theory of von Mises, it may be called 
a ^'compromise” concept. 

1 See p. 248. 

® Cf, VON Mises, Prohahility, StaiisticSj and Truthy pp. 125-128, and the 
whole of the fourth lecture. Also, see Nagel, op. cit.y pp. 37. 

® The fundamental papers supporting this claim are those of A. H. Cope¬ 
land, American Journal of MathematicSj Vol. 50, pp. 535-552; Vol. 51, pp. 
612-618; Vol. 53, pp. 153-162; and Vol. 58, pp. 181-192, and a paper by 
A. Wald, Ergehnisse eines maikematischen KolloquiumSy Wien, No. 8, pp. 
38-72. See, however, the recent criticism of Maurice Fr5chet, Journal of 
Unified Sdencey Vol. 8, pp. 1-22. He rcifers to an example by Ville in 
which a given set is so defined that it meets all the conditions laid down by 
Wald and yet contains the regularity that the relative frequency always 
converges to its limit by values that are greater than p. He admits, how¬ 
ever, that von Mises does not feel that this creates any difficulty in his 
theory. 

♦ See footnote (2), p. 236, 



PROBABILITY 


251 


The intuitive-axiomatic approach to probability diffeiis from the classical 
theory in that it avoids the circularity of “equal likelihood/' It merely 
definies probabDity as a relative frequency of a certain attribute in a given 
set of objects without any statement as to whether these are “equally 
likely/' It consequently avoids introduemg any subjective elements into 
the definition of probability/ 

On the other hand, the intuitive-axiomatic approach differs from von 
Mises' approach in that it does not identify probability with the mathe¬ 
matical limit approached by the relative frequency of a given attribute in 
an infinite random sequence of actual events. It may indeed take the 
relative frequency of a certain attribute in a hypothetical infinite set as the 
probability of that attribute, but this need not be the mathematical limit 
approached by the relative frequency of any actual type of event. It 
merely says that, if relative frequencies of random mass data are treated as 
if they were mathematical probabilities of some hypothetical infinite set, 
then the calculations derived on this assumption will be satisfactorily close 
to the relative frequencies of various combinations of these data. “Ran¬ 
domness” is left as an undefined, intuitive concept. 

^ The subjective elements and even the concept of equal likelihood enter 
to some extent, however, in the determination of “randomness” (see pp. 241- 
242). 
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PROBABILITY DISTRIBUTIONS 


A description of a fundamental probability set giving the 
various categories, or classes, into which the members of the set 
are grouped, together with the probabilities of each gi’oup, is 
called a 'probability distribution. The property of a member of a 
given group may be spoken of as an “attribute,’^ and a prob¬ 
ability distribution may be said to show how the total probability 
is distributed among the various attributes. Since the attributes 
of a fundamental probability set are necessarily mutually exclu¬ 
sive and since a member of a set must possess some one of the 
given attributes, it follows tha,t the total probability, i.e., the 
sum of the probabilities of the various attributes, must equal 1. 
In other words, the percentages (probabilities) of the cases falling 
in the various groups must add up to 100 per cent. 

For example, the distribution of probability among the four 
suits of an ordinary playing card deck is 


Spades 

1 

4 

or 

25 per cent 

Hearts 

1 

T 

or 

25 per cent 

Diamonds 

1 

4 

or 

25 per cent 

Clubs 

1 

T 

or 

25 per cent 


The quality of being a spade, heart, diamond, or club is the 
attribute of a card. Since a card cannot be both a spade and a 
heart, say, these attributes are mutually exclusive; and since a 
card must belong to one of the four suits, the total probability is 1. 

Similarly, the distribution of probability among the six faces 
of an ordinary die is 
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The quality of being a 1, 2, 3,4, 5, or 6 is the attribute of a face. 
Since a face of a die cannot be both a 2 and a 6, say, these attri¬ 
butes are mutually exclusive, and since a face must have one 
of the markings listed above, the total probability is again 1. 

Discrete Probability Distributions. When the attributes of a 
set are qualitative in character, such as spades or hearts in the 
case of a deck of cards or heads or tails in the case of coins, or 
when they are represented by a set of numerical values that do 
not vary continuously, such as the number of spots on the face 
of a die, the distribution of probability is said to be discrete.^’ 
If the attributes are represented by points on a horizontal axis 
and their probabilities measured along a vertical axis, a discrete 



Fig. 77.—Distribution of 
probability of heads and tails on 
a coin. 



Spots on face of die 


Fig. 78.—Distribution of 
probability of the spots on the 
faces of a die. 


probabiUty distribution may be pictured by a series of lines or 
bars as in Figs. 77 and 78. It mil be noted that it is the height 
of the bar in each case that measures the probability of the attri¬ 
bute at which it is erected. 

Continuous Probability Distributions. If the members of a 
set consist of the numerical figures obtained by the repeated 
measurement of the length of a given table or the continued 
measurement of the heights of adult white males, living now and 
in the future, the attributes assumed by the members may form a 
continuous variable. In such a case, the total probability of 
1 can be considered to be distributed over the whole range 
of variation; it will thus form a “continuousdistribution of 
probability. More exactly, the range may be divided into small 
class intervals, and location within one of these intervals may be 
taken as the attribute of a member of the set. In this instance, 
the probability of a member belonging to a given interval may be 
represented by the area of a rectangle erected over that interval, 



254 


THE NORMAL FREQUENCY CURVE 


and the total distribution may be pictured as a set of such rec¬ 
tangles in the manner shown in Fig. 79. If, now, the class inter¬ 
vals are made smaller and smaller, the tops of these rectangles 
will tend td sketch out a smooth curve (c/. Fig. 80). A prob¬ 
ability curve of this sort can be looked upon as the limit 

approached as the class inter¬ 
vals into which the range is 
divided are made infinitesimally 
small. 

Frequency Distributions as 
Probability Distributions. 
From the definition of proba¬ 
bility given above, it follows 
that any frequency distribution in which the frequencies are 
expressed as a percentage of the total number of cases is a 
distribution of probability of the given set of cases. It likewise 
follows that a frequency curve that represents the distribution 
of relative frequency of an infinite population of cases ^ is also a 
probability curve. 


64 66 68 70 72 74 76 

Inches 

Fig. 79.—A continuous distribution of 
p^obabiliti^ 



Inches 

Fig. 80.—A probability curve. 


Since a distribution of relative frequency and a distribution of 
probability are thus one and the same thing, all the measures of 
the various characteristics of frequency distributions automati¬ 
cally apply to probability distributions. Thus a probability 
distribution has a mean, a standard deviation, a coefficient of 
skewness, and a coefficient of kurtosis, like any frequency 
distribution. 


^ See p. 238. 
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ALGEBRAIC AND GRAPHIC REPRESENTATION OF THE NORMAL 
FREQUENCY CURVE 

Ftmctional Relationships. Before entering upon a mathe¬ 
matical and graphic representation of frequency and probability 
curves, it may be well to review briefly the algebraic and geo¬ 
metric description of simple functional relationships. This is 
the purpose of the present section. 

If one quantity varies when a second varies, the first is said to 
be a '‘function^’ of the second. The pressure of air in an auto¬ 
mobile tire, for example, varies with the temperature; pressure is 
thus a function of temperature. Again, the quantity of butter 
bought varies with its price; hence, the purchase of butter is a 
function of its price. 

Functional relationships of this kind are described symbolically 
by such expressions as t/ = f{x)y y = F{x)y y = G{x), y = ^(x), 
and y = yp{x) —all of which are to be read'^V is a function of x*^ 
or, more specifically, ‘V is the/function of Xy^ ‘V is the F func¬ 
tion of Xy^ etc. The expressions y = f{x) and y = F{x) are the 
most common; the others are often used, however, when a 
problem involves more than one functional relationship. For 
example, if y and z are both functions of x, this may be expressed 
by 2 / = f{x) and 2: = g{x). 

Frequently a quantity varies, not merely with one, but with a 
number of other quantities. The former may then be said to be 
a joint function” of the latter. Thus the volume of gas in a 
tube is a function of the pressure and the temperature (Boyle’s 
law); the quantity of butter bought is a function of the price of 
butter and the income of its purchasers. Joint functional rela¬ 
tionships of this kind are expressed by 2/ = f{XyZ)y y = F{XyZ)y 
y = ip{Xyz)y etc.,—all to be read ^^y is a function of x and z” 

Explicit and Implicit Functions, The functional relationships 
so far considered are ^‘explicit” functions. In explicit functions 
one variable is selected as the dependent variable, and the other 
or others as the independent variables; this is indicated by sit¬ 
ing the dependent variable to the left of the equal sign. Often, 
however, it is convenient to talk of two variables as being func¬ 
tionally related without indicating which is to be taken as 
dependent and which as independent variable. Such a func- 



256 


THE NORMAL FREQUENCY CURVE 


tional relationship is indicated by f(Xjy) = 0 , F{x,y) — 0 , 
ip{x,y) = 0, etc., or if there are more than two variables, by 
f{x,y,z,) = 0, F{x,y,z) = 0, <p(x,y,z) = 0, etc. These all mean 
that X and y or x, y^ and z are ‘‘functionally related.Functions 
of this kind are called “ implicit functions. An explicit function 
can often (although not always) be derived from an implicit 
function by merely solving the latter for the variable selected as 
dependent. 

The simplest type of functional relationship is expressed by a 
polynomial. A polynomial in x means such expressions as x 
(strictly speaking a monomial), a + ^, a + hXy and a + hx + cx-. 
The “degree” of the polynomial is the highest power of x that 
occurs in the expression. Thus a + bx is Si, polynomial of the 
first degree; a + hx + cx^ and a + cx^^ are polynomials of the 
second degree; and a + hx cx^ + rfx®, a + bx + dx^, 
a + cx^ + dx^j and a + dx^ are polynomials of the third degree. 
Polynomials in two variables, say x and z, are illustrated by 
a bx + gz, a bx + cx^ gz -]r hz"^, a + bx + gz + mxZy and 
a + bx + kz^y the first being of the first degree, the second and 
third of the second degree, and the last of the third degree. 

If y varies by a constant absolute amount every time x varies 
by a fixed given amount, the function that expresses this rela¬ 
tionship is a first-degree polynomial in Xy such as y == a + 6^. 
For every time x increases (or decreases) by one unit, y increases 
(or decreases) by b units. Thus, if 2 / = 10 + 2x, y increases (or 
decreases) by two units every time x increases (or decreases) by 
one unit. If b is negative, then the variation in y is opposite in 
direction to that of x. For example, if y = 10 — 2xy then y 
decreases (or increases) by two units every time x increases (or 
decreases) by one unit. The quantity a is the value of y when 
X equals 0; if 2/ = 0 when x = 0, then a must be zero. 

When a functional relationship can be represented in this way 
by a first-degree polynomial, such as 2 / = a + then y is said 
to be a “linear” function of x. This continues to hold true 
when there is more than one independent variable. Thus 
y = a bx + gz is said to express 2 / as a linear function of 
X and z. If the change in y that accompanies a given change in x 
varies with the value of Xy then the functional relationship can 
no longer be expressed by a first-degree polynomial in x. The 
function is in this case “nonlinear,” and a higher degree poly- 
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nomial or some more complex expression must be employed to 
indicate the relationship. A nonlinear relationship between y 
and two or more variables, such as x and z, must also be expressed 
by some function other than a first-degree polynomial in these 
two variables. 

Graphs of Simple Functions. A pair of values, x and may 
be represented by a point in a plane. The point P in Fig. 81, 
for example, represents the value x == xo and y — yo] i^e., the 
coordinates of P are {xo,yo)- 

First-degree Polynomials, The graph of a first-degree poly¬ 
nomial is a straight line (hence the name “linear^’ relationship). 



Fig. 81.—Plotting of a point. Fig. 82.—Graph of a straight line. 


and conversely every straight line may be represented algebrai¬ 
cally by a polynomial of the first degree. The simplest way to 
comprehend this relationship between the algebraic and the 
geometric presentation is to think of a straight line in reference 
to (1) the angle it makes with the cc-axis and (2) the intercept 
it cuts on the y-axis. Thus, in Fig. 82, let d represent the angle 
CAB that the line makes with the a:-axis at A. The tangent of 
this angle is the slope of the line AB, Let h represent this slope; 
then b = tan 6. It is evident that a straight line is determined 
when its slope b and its y intercept, a (or OP), is found, as follows: 
Let P (whose coordinates are any x, y) be a representative point 
of the line. Take PC, the perpendicular to Oxj from P, and 
BDj the perpendicular to PC, from P. Then, 

DP 

tan DBP — 
or 

DP = tan DBP X BD 
DP CP - CD = CP - OB y - a 


But 


( 1 ) 
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Also, 

BD OC = X and tan DBF *= tan CAP 

(similar triangles) 

and 

tan CAF = tan 6 = b 

Now, since DP = y — a, and tan DBP = tan 0 = b, and BD = x, 
then, upon substituting in Eq. (1), 

y — a = bx or y — a + bx (2) 



Fig. 


.JL_ 

M 

83.—Graph of a parabola, 
y = cx^ (c > 0). 


A straight line thus represents y as equal to a polynomial in x 
of the first degree. When the line passes through the origin, 
a = 0 and the functional relationship becomes simply y = bx. 

Second-degree Polynomials, If the functional relationship 
between x and y takes the form of a second-degree polynomial, 

such as y = a + bx + cx^j its 
graph will be a parabola. By 
definition, a parabola is the locus 
{p(xly*) of all points equidistant from a 
fixed point called the “focus’' and 
a fixed line called the “directrix.” 
In Fig. 83, the focus is the point 
(F), and the directrix is the line 
(t/ + F = 0). Take any point 
on the parabola, P(x',!/'), and 
draw from it a line perpendicular to ES, at point M ; then draw 
a line from the focus (F) to the point P, 

At the point x = 0, y = 0, it is obvious that the parabola is 
equidistant from the directrix + F = 0, or y = — F, and the 
focus at X = 0, 2 / = F. 

Now, if a line were drawn from P perpendicular to the 2 /-axis, 
at C, it is clear that FF^ = {y' _ ^ {x'Y^ since FC = y' — F 

and PC = x'. This is true since the square of the hypotenuse is 
equal to the sum of the squares of the other two sides of a right- 
angled triangle. Furthermore, it can be seen from Fig. 83 that 
MP^ = {y' + PY l)ecause MP is y^ + F. By hypothesis, 
MP^ = and hehce, by substitution, 

(!/' + FY = (2/' - FY + W 

which is true for any value of x' and 2 /'; that is, any x and y, and 
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which by transposition and simplification becomes 
= 4Fy or V = ^ 

This is of the general form of y = cx^. If the fcurve had not 
passed through the origin, its equation would have been of the 
form y = a + cx^] and if its vertex had come either to the right 
or left of the j/-axis, its equation would have been 

y = a + hx + (3) 

If the value of c is negative, the curve turns down as in Fig. 84 
instead of up as in Fig. 83. These parabolic curves thus illus¬ 
trate the form of a functional relationship in which y is set equal 
to a second-degree polynomial in x» 



Fig. 84.—Graph of a parabola, 
y = —cx^ (c > 0). 



Fig, 85.—Graph of a circle, 

-f ^2 — J.2^ 


The Circle, The implicit functional relationship 


^2 ^2 __ ^2 _ 0 


(4) 


is of interest in that its graph is a circle. By definition, a circle 
is the locus of all points equidistant from a fixed point called the 
“center.’^ In Fig. 85, the center is taken at the origin. By the 
property of right-angle triangles, the square of the distance of 
any point P from the center at 0 (the origin) is simply + y^. 
Since by definition this must be the same for all points on the 
circle, the equation of the circle is simply x^ + y^ ^ r^, where 
r is the radius. By transposition, this becomes + y^ — = 0. 

If the center of the circle had been at the point (a,fe) instead of 
at the origin, its equation would have been 


{x — ay + (y — 2>)^ ^ = 0 
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An implicit functional relationship, therefore, in which two 
variables enter to the second degree with identical coefficients 
but in which there is no cross-product term (such as xy) is simply 
the algebraic expression for a circle. 

The Ellipse. If the functional relationship represented by a 
circle is modified so that the coefficients of the second-degree 
terms are no longer identical (but still retain the same signs), its 
geometric counterpart is distorted so as to become an ellipse. 
Thus the graph of the implicit functional relationship 


ax- + hy^ — r- = 0 


( 5 ) 


is an ellipse whose semimajor axis is rjy/a and whose semiminor 
axis is rfy/h (c/. Fig. 86). If h is less than a, then the ellipse 
runs the other way, as in Fig. 87. 



Fig. 86. —Graph of an ellipse, ax^ 
+ 52/2 _ ,.2 = 0 (a < 5). 



Fig. 87.—Graph of an ellipse, ax^ 
+ 52/2 - r2 = 0 (a > h). 


It will be noted that, if either of the implicit relationships 
^2 _|_ 2/2 _ y .2 _ Q Qj. ^ 2/2 — 7-2 = 0 is solved for t/, the 

resulting solution gives y^ not as a single-valued function of x, but 
as a double-valued function. Thus, in the case of the circle, 
2/ = ± 'v^ 7-2 — which shows that, for each value of x, there are 

ax^ 

two values of y. Likewise, for the ellipse, y = - 

which again gives two values of y for each value of x. Geo¬ 
metrically this means that a line perpendicular to the a;-axis 
cuts the circle and ellipse at two different points. In contrast, 
the straight line and the parabola express 2 / as a single-valued 
function of x. The difference is due to the fact that, in the case 
of the circle and ellipse, it is y^ and not y that is expressed as a 
polynomial in x and when the square root is taken two values of 
y result. 



PROBABILITY DISTRIBUTIONS 


261 


Rising Exponential Function, A simple function that does not 
express y or as a polynomial in x is the exponential, or ^'com¬ 
pound-interest,^^ function. Suppose that a sum of $1 is put 
out at interest at 6 per cent per year compounded for 3 years; 
then the value of this sum at the end of the 3 years would be 
as follows: 

Value at end of 3 years = (1 + 0.06) (1 + 0.06) (1 + 0.06) 

= (1 + 0 . 06)3 

In general, if $1 is put out at interest for x years and compounded 
at the rate of r per annum, its value at the end of the x years 
would be as follows: 


Value at end of x years = (1 + O'*" 

If the interest is compounded every 6 months instead of every 
year, then the value at the end of x years would be as follows: 


Value at end of x years 




for the interest rate for 6-months is just half the rate for a year 
and the number of 6-month periods is just double the number of 
years. If the interest is compounded every quarter, then the 

value of the $1 at the end of x years would V ^ “I” 4 ) 5 

if the interest is compounded every 1/nth of a year, the value at 

( A"" 

the end of x years would be I 1 + - I , r always being the rate of 
simple interest per year. This last may be written 


(■if 


If, now, n is made infinitely large, in other words, if the period of 
compounding (that is, 1/nth of a year) is made infinitesimally 
small, so that the operation of compounding may be viewed as 


practically continuous, then the quantity 




-V is equal 


approximately to e, the base of the Napierian system of loga- 
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rithms. 


For, by definition, e is the limit of 



as m 


approaches infinity. The value of $1 at the end of x years when 
interest at r per annum is compounded continuously is therefore 


The quantity c, it will be recalled, is a numerical constant 
equal approximately^ to 2 . 7183 , so that the value of a sum 
compounded continuously for x years is given by a function of 
the form y = a^*, where a and b are merely constants. The 
compound-interest function is thus an ‘^exponentiaT^ function. 



Fig. 88.—Graph of a rising exponential 
function, y = e*** (r >0). 



Fig. 89. —Graph of a declining ex¬ 
ponential function, y ~ e“’‘* (r > 0), 


for the variable x enters into the function as an exponent. A 
graph of this function for a positive value of 6(= r) is shown in 
Fig. 88. 

Declining Exponential Function, The ‘^present value'' of $1 
due X years hence, interest being compounded continuously at 
the rate of r per annum, would be equal to For since the 

sum of $1 would accumulate to e’'® dollars at the end of x years, the 
sum of 1/e^*) dollars would accumulate to 

^-rx . gr® ^0 1 dollar 


1 This may be easily proved by putting increasingly large values of m in 
the formula e = ^1 -f ^ . Thus, 


for m « 10, e = (1 4- * (1.1)^® 2.504; 

for w « 50, « « (1 4- A)*® * (1.02)»o « 2.691; 

( 1 \ 1,000 

1 + » (1.001)1'»0» * 2.7171; 

for m = 10,000, c * (1 4- 1/10,000)1®*®®® « (1.0001)i®*®®® * 2.7182, the 
calculations being carried out by logarithms (see Appendix, Table I). 
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at the end of that time, and hence the present value of $1 due x 
years hence is This ^'discount'' function, as it may be 

called, is thus a negative exponential function. Its graph is 
shown in Fig. 89 . 

Whereas the exponential functions are not themselves poly¬ 
nomials, it will be noted that the expression that constitutes the 
“exponent of c” is a first-degree polynomial in x (more accu¬ 
rately, the monomial x). This 
suggests that interesting modi¬ 
fications of the exponential fimc- 
tion might be obtained by 
making the exponent of c a 
second or higher degree poly¬ 
nomial in X. A function of this 
kind that is of very great impor¬ 
tance in the theory of frequency 
curves is i/ = Figure 90 

shows how well this function represents the general shape of a 
frequency distribution; as will be seen shortly, it is the kernel of 
the formula for the normal frequency curve. 

Formula for the Normal Frequency and Probability Curves. 
Formulas for frequency and probability curves may be written 
in two ways. The first, which may be symbolized by 2/ = 
merely expresses the ordinate of the curve 2/ as a function of the 
attribute X whose frequency is being measured. is used here 
to signify “function of,^’ instead of the usual F or/, in order to 
avoid confusion with the employment of the latter to indicate 
frequencies.) In this form the formula simply describes the 
locus of the points that constitute the frequency or probability 
curve. 

The second method of writing a frequency or probability 
formula is d(F/N) or dP = (p{X) dX, In this form, the prob¬ 
ability or relative frequency of a case lying between X and 
X + dX is expressed as a function of the attribute X. It will 
be recalled that a probability or frequency curve is the limit 
approached by an area histogram as the class interval is made 
infinitesimally small. The expression d(F/N) or dP = (p{x) dX 
merely says that, when the class interval (of size dXy is made 

^ The letters dXy dP, d(F/N), etc., are to be read as a single symbol and 
not as the product d and X or d amd P, etc. The symbol dX means an 
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infinitesimally small, the area under the curve for any class 
interval [that is, d(F/N) or dP] is approximately equal to the 
area of a small rectangle whose base is dX and whose height is 
the ordinate of the curve (p(X) at X (cf. Fig. 91). This second 
method is, strictly speaking, the proper method for describing a 

probability^^ or ^‘frequency’’ 
curve, since it is this and not the 
Area under which actually expresses the 

curvej'.e., I \ probability or frequency of a given 

given approx- | class interval as a function of the 

imafel)f by j attribute X, The former gives 

recfangte j merely an algebraic expression for 

the curve and not the area under 
I the curve. ^ 

_1_„ The Normal Curve. One curve 

occurs very often in statistical 
Fig. 91. — Graphical representa- analysis, especially in the theory 
tion of a probabiUty function. sampling. This is the normal 

frequency curve. Its algebraic and graphic representation may 
profitably be illustrated by a brief description. 

The mathematical formula for the normal curve is 


Y-y=<pfx) 


Fig. 91.—Graphical representa¬ 
tion of a probability function. 


0 2.^ dX 


where «(= 2.7183+) is the base of the Napierian system of 
logarithms, X is the mean of the distribution, and o- is its stand¬ 
ard deviation. Pictures of the curve are given in Figs. 92a and 
926. It will be noted that the curve is symmetrical and gen¬ 
erally bell-shaped. 

Owing to its symmetry, the center of the curve comes at the 
mean point X = X. Here the curve also reaches its greatest 

height, viz., -and from there it slopes gradually downward 

a v 27r 

on each side as the factor e assumes greater and greater 

infinitesimally small part of the X range, the symbol dP means an infinitesi¬ 
mal probability, and the symbol d{F/N) means an infinitesimal relative 
frequency. 

^ The first method of description is, however, the only form that is appor- 
priate for a discrete distribution. 
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significance.^ The points of inflection, i.e., the points at which 
the left and right branches change from concave to convex 
downward fall X = X ± a. About two-thirds of the area 
under the curve lies between these two points of inflection, and 
about 95 per cent between X — 2(r and X + 2cr. ^ The average 
deviation equals 0.7979 times the standard deviation, and the 
fourth moment equals three times the square of the second 
moment (hence )82 = 3). 



Fig. 92a.—Graphs of normal frequency curves with different means but same 
standard deviations. 


As will be noted from the formula, a particular normal curve 
is determined when its mean X and its standard deviation <r are 
given. Different normal curves, then, will have either different 
means or different standard deviations, or both. If the means 
are different, the curves will have different positions on the X- 
axis; if the standard deviations are different, the curves will be 
of different widths. This is illustrated in Figs. 92a and 926. 

Although normal curves may thus differ in respect to their 
means and standard deviations, they all possess the essential 
“normal” form. This may be brought out by measuring the 
attribute A, not in original units, such as pounds or dollars, but 
as a deviation from the mean of X measured in standard-devia- 


^ It will be noted that e = 


(X-X)2 
, 2^2 


When X = Xj this becomes 


(X~X)» 

1/c® — I — 1. As X moves away from X in either direction, e 

becomes larger and- larger and hence- r- becomes smaller and smaller. 

(X X) 2 

e 2"* 

All the ordinates are positive since the exponent of 6 is squared. 
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tion units, i.e., as' 


it X 
— or — 
<r 


Whenever this is done, all 


normal curves become one and the same curve. 

Suppose, for example, that X in one case represents pounds 
and in another quarts. In both cases the probability or relative 
frequency is ‘^normally’' distributed, but the mean of the first 
distribution is 4 pounds and its standard deviation is 1.5 pounds. 
This distribution is represented by curve A in Fig. 92a. The 
second distribution has a mean of 6 quarts and a standard 
deviation of 0.5 quart; it is represented by curve D in Fig. 926. 
If, however, the unit adopted for the measurement of the attri¬ 
bute of an element is in the first case 



Fig. 926.—Graphs of normal frequency curves with same mean, but different 
standard deviations. 


second case V then in terms of these ^- = - 

0.5 qt. a a 

units the two distributions will be identical. It is thus possible 

to reduce all normal distributions to a standard normal form^ 

(see Fig. 93). 

Consider now the relationship between the mathematical 
formula for the curve and this standard normal form. In the 
individual or nonstandard form, the formula for the curve is 

1 

that given above, viz., dP =- — e dX, This says that 

c v&r 

the infinitesimal portion of the total area under the curve dP cut 
off by the infinitesimal class interval X to X + dX is approxi¬ 
mately equal to the area of an infinitesimal rectangle whose 

^ It will be noted that in the standard form the height of the middle 
ordinate is l/\/^ “ 0.3989. 
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: e 2 (r 2 whose base is dX. If now the 


height IS 

ff \/2 t 

X — X X 

attribute is expressed in - = - units, the mathematical 

(TO" 

formula becomes, 


dP = c 5(0 d M 


In this case, the infinitesimal portion of the total area cut off 

XX f x\ 

by the infinitesimal class interval ^ ^ “I” ^ I ^expressed as 


the area of a rectangle whose height is — 7 ^ ^ 

\/2t 


1 -KO 


and whose 



Fig. 93.—Graph of a standard normal curve. 


base is d{x/a). In other words, the effect of measuring the attri¬ 
bute in xl<j units instead of X units is to change the size of the 
unit class interval on which the rectangle rests but at the same 
time to change its height proportionately so its area dP remains 
the same. 


1 — 

A table giving the value of — 7 = e for various values of 

V 27r 

xja will be found in the Appendix, Table VII. These are the 
ordinates of the standard normal curve. It was by means of 
this table that Fig. 93 was plotted; its use in fittingthe normal 
curve to an actual set of data will be explained in the next 
chapter.^ 


1 See pp. 277 and 295-304. 
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PROBABILITY CALCULUS 

Two Fundamental Theorems. There are two fundamental 
theorems in the calculus of probabilities. These are the addition 
theorem and the multiplication theorem. There is also a special 
form of the latter that pertains only to independent” attributes. 

Addition Theorem, The addition theorem pertains to the 
summation of the probabilities of one and the same probability 
set. Since the several attributes of a given probability set are 
necessarily mutually exclusive, it follows that the relative fre¬ 
quency of cases having either one of two attributes is the sum 
of the relative frequencies of the separate attributes. For 
example, there are 13 hearts in an ordinary deck of 52 playing 
cards, and also 13 diamonds. The relative frequency of a heart 
is therefore and the relative frequency of a diamond is also , 
The relative frequency of a red card, i.e., either a heart or a 
diamond, is ff, which equals if + it- Since the relative fre¬ 
quencies of the attributes are by definition their probabilities, it 
follows that the probability of a member of a set having either 
one of several attributes is simply the sum of the individual 
probabilities. The probability of a heart or a diamond is thus 
^ + if = ft = f. This addition theorem is valid for infinite 
probability sets as well as for finite sets. 

Algebraically the addition theorem may be expressed as 
follows: If the attributes of a given set are Xi, X 2 . . . , Xs 
(representing either qualitative or quantitative characteristics) 
and their probabilities are pi,p 2 , • . . , Ps, then the probability of 
Xi, X 2 , or X3, say, that is, the attribute of being any one of these 
X^s, is simply pi + P 2 + Ps- If the variation in attributes 
within the set is continuous and if the distribution of probability 
is described by a formula such as dP = (p{X) dX, then the prob¬ 
ability of an attribute within any one of a number of small 
ranges dX whose sum constitutes the range Xi to X 2 is given 
by ZdP = 2)^(X) dXy or, in the symbolism of the integral 
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calculus, / = / <p{X) dX, The significance of this theorem 

will become clearer when its application to particular problems is 
considered. 

The addition theorem is sometimes stated as follows: The prob¬ 
ability of either one of two mutually exclusive events (attributes) 
is the sum of their individual probabilities. This version of the 
theorem is perfectly valid if it is understood that the two attri¬ 
butes belong to the same set. It will be recalled that all the 
attributes of a set are mutually exclusive.^ It is not true, how¬ 
ever, that all mutually exclusive attributes belong to the same 
set. To illustrate this point von Mises gives the following 
example: 2 

Suppose that the probability of a man dying between his 
fortieth and forty-first birthdays is 0.011 and the probability of 
his marrying between his forty-first and forty-second birthdays 
is 0.009. These events are mutually exclusive, but it cannot 
be said that the probability of a man either dying in his fortieth 
year or marrying in his forty-first year is 0.011 + 0.009 = 0.02. 
The two events do not belong to the same set, and the addition 
theorem can be validly applied only to attributes of one and the 
same set. 

Multiplication Theorem, The multiplication theorem pertains 
to the calculation of a probability of a ‘‘derived^’ or “second- 
order^’ probability set from the probabilities of two or more 
“first-order” sets. Consider, for example, an ordinary deck of 
playing cards and a pinochle deck. Each may be said to con¬ 
stitute a “first-order” probability set. In the first set the prob¬ 
ability of an ace is = iV, and in the second the probability of 
an ace is ^ = i. Let the third set be formed from these two 
first-order sets by combining each card of one first-order set with 
each card of the other first-order set. Furthermore, let the 
attribute of any pair be the values of the cards making up the 
pair, such as a king and a nine, an ace and a queen, a two and a 
ten, and the like. The probability of any one attribute in this 
second-order set, say the probability of a pair consisting of two 
aces, is the relative frequency of such pairs among all possible 
pairs that might be formed. It is the purpose of the multiplica¬ 
tion theorem to give a general rule by which a probability of a 

1 See p. 252. 

2 Probability^ Statistics^ and Truths p. 54. 



270 


THE NORMAL FREQUENCY CURVE 


second-order set of this kind can be computed from the prob¬ 
abilities of the first-order sets. 

In deriving the multiplication theorem, two cases must be 
distinguished, one pertaining to independent probabilities and 
the other to dependent probabilities. Consider first the case of 
independent probabilities. Suppose that in finding all the vari¬ 
ous pairs of cards that might be composed of one card from each 
deck there is no limitation on how the cards might be matched. 
Then each card of the ordinary deck will have associated with it 
every card of the pinochle deck. Hence the set of pinochle cards 
that are paired with the ace of spades, say, from the ordinary 
deck will be the same as the set of pinochle cards paired with the 
jack of diamonds, say, or in fact with any other card from the 
ordinary deck. 

Since the set of pinochle cards, paired with each card of the 
ordinary deck is thus the same as the set of pinochle cards paired 
with every other card from the ordinary deck and since this 
common set is identical with the original pinochle set, it follows 
that the probability of any given pinochle card in the set paired 
with any given card of the ordinary deck is the same as the 
probability of that pinochle card in the original pinochle deck. 
This being the case, the probability^ of a card from the pinochle 
deck is said to be independent of the attribute of the card from 
the ordinary deck. 

In general, the probabilities of set II are said to be independent 
of the attributes of probability set I if, when the members of the 
two sets are paired together, the probability of an attribute in 
the subset of members of set II paired with any given member of 
set I is the same as the probability of that attribute in the 
original set II. Symbolically, P(jB) is independentof .d ifP(J?/il), 
that is, the probability of B given A, equals P(iJ), that is, the 
probability of B. 

Consider once again the given example. Since each card of the 
ordinary deck is associated Avith each card of the pinochle deck, 
the total number of different^ pairs of cards that can be formed is 

^ Different in the sense that the cards going to make up any pair are not 
precisely the same cards as those making up any other pair. This means 
that the two aces of spades, the two kings of spades, the two jacks of dia¬ 
monds, etc., in the pinochle deck must be considered as different cards, even 
though their value and suit are the same. 
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52 X 48, for there are 52 ways in which the first card can be 
picked and 48 ways in which the second card can be picked. 
Likewise, the number of combinations that would consist of two 
aces would be 4 X 8 = 32. Hence the probability of a pair o£ 
aces in the whole set of pairs of cards is 32/2,496 = But 
from the calculations this is seen to be equal to 5 V X A = tV- 
In other words, the probability of a pair of aces in the second- 
order probability set is equal to the probability of an ace in the 
ordinary deck times the probability of an ace in the pinochle 
deck. 

The multiplication theorem for independent probabilities may 
thus be stated as follows: If pa is the probability of a member 
of set I having the attribute A and if qb is the probability of a 
member of set II having the attribute B, and if Qb is independent 
of the attribute assumed by a member of set I, then the proba¬ 
bility of a member of the derived set I-II having the attribute 
ABy that is, the probability of a pair AB among all possible pairs 
of two elements from each of the two sets I and II, is the product 
of the probabilities pa and qb- In simpler form, if P{B) is inde¬ 
pendent of Ay the joint probability of A and B is equal to the 
probability of A times the probability of By that is, P{AB) 
^p{A)^p{By 

Consider next the case of dependent probabilities. Suppose 
that, in picking pairs of cards from the two decks, the following 
modification is introduced. Suppose that every time the card 
picked from the ordinary playing card deck is an ace, a king, a 
queen, or a jack and that, before any selection is made from the 
pinochle deck, an ace, a king, and a queen from each suit is 
discarded from the deck and is replaced by a jack, a ten, and a 
nine of each suit. The pinochle deck would then contain the 
same number of cards as before, but it would have 4 aces, kings, 
and queens instead of 8, and 12 jacks, tens, and nines instead of 8. 
After the pinochle deck has been modified in this way, let a card 
be selected from it and combined with the ace, king, queen, or 
jack picked from the ordinary deck. If the card picked from the 
ordinary deck is not an ace, a king, a queen, or a jack, let no 
modification be made in the pinochle deck. The effect of this 
modification in the method of forming pairs of cards is to make 
the attributes of the second set dependent on those of the first. 
For the set of cards of the pinochle deck associated with an ace. 
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a king, a queen, or a jack of the ordinary deck is different from 
the set of cards of the pinochle deck associated with other cards 
of the ordinary deck. The probability of a given type of card 
from the pinochle pack now depends on the card from the 
ordinary deck; no longer does P(B/A) = P{B). 

The number of different pairs of cards that can be made in the 
way just described is 52 X 48 as before, for the first card can still 
be selected in 52 ways and the second in 48 wa 3 ^s. The number 
of pairs of cards consisting of two aces, however, is now 4X4 
instead of 4 X 8 as in the previous example. Hence the proba¬ 
bility of a pair of aces among the whole set of pairs is 

4X4^1 
52 X 48 156 

It will be noted, however, that this probability of a pair of aces 
is the product of the probability of an ace in the ordinary deck 
(that is, times the probability of an ace in the modified 
pinochle deck (that is, ^). In other words, the probability of a 
pair of aces is the probability of an ace in the ordinary deck 
times the probabilit}^ of an ace in the pinochle deck given the 
selection of an ace from the ordinary deck. 

The multiplication rule for dependent attributes may thus be 
stated as follows: If members of probability set II are paired 
with members of probability set I in such a way that the prob¬ 
ability of an attribute in the subset of members of set II paired 
with any given member of set I varies with the attribute 
of that member of set I, then the probability of the pair AB in 
the whole set of paired values is equal to the product of the 
probability of A in set I times the probability of B in that subset 
of set II associated mth the given attribute A, Symbolically, 
the multiplication rule for dependent attributes is 

PiAB) = P{A) -PiB/A) 

that is, the probability of A B equals the probability of A times 
the probability of B given A, Since, when P{B) is independent 
of A, P{B/A) = P(B)^ the multiplication rule for independent 
probabilities is a special case of the general formula 

P{AB) = P(A) • P{B/A) 

The multiplication theorem for both dependent and independent 
probabilities is valid for infinite sets as well as finite sets. 
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The significance of independence and dependence may be 
illustrated by a case pertaining to real life. Suppose that one 
probability set consists of American fathers of the white race 
and the other probability set consists of their eldest sons. Sup¬ 
pose the attributes distinguished in each set are dying from 
cancer, dying from heart disease, dying from tuberculosis, and 
dying from other causes. If, now, the probability of a son dying 
from tuberculosis, say, is greater, among those sons whose fathers 
died of tuberculosis, than among the whole group of sons, then 
the probability of a son dying from a certain cause is not inde¬ 
pendent of the cause of death of his father. If, on the other 
hand, the probability of a son dying from any particular cause 
is the same for the sons whose fathers died from cancer, the 
sons whose fathers died from heart disease, the sons whose fathers 
died from tuberculosis, and the sons whose fathers died from 
other causes, i.e., if the probability of a son dying from any 
particular cause is the same, whatever the cause of the father's 
death, then the probability of a son's.death is independent of the 
cause of death of his father. For example, a case of dependence 
Avould be the following: 


Probability of death of eldest son from 



Cancer 

Heart 

disease 

Tubercu¬ 

losis 

Other 

causes 

Eldest sons whose fathers died of 
Cancer. 

0.310 

0.102 

0.030 

0.558 

Heart disease. 

0.218 

0.151 

0.041 

0.590 

Tuberculosis. 

0.220 

0.118 

0.093 

0.569 

Other causes. 

0.215 

0.112 

0.042 

0.631 

All eldest sons. 

0.228 

0.120 

0.046 

0.606 



A case of independence would be that in which the figures 
in every row of every column were the same and these in turn 
were the same as the figures for “All sons." If the probability 
of a father dying from cancer was 0.228, from heart disease 
0.120, from tuberculosis 0.046, and from other causes 0.606, 
then, in the case of dependence represented by the above table, 
the probability of both a father and an eldest son dying from 
heart disease would be (0.120) (0.151) = 0.018. In the case of 
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independence, on the other hand, the probability of both a 
father and an eldest son dying of heart disease would be 

(0.120)(0.120) = 0.014 

Illustrations. The following examples will help to illustrate 
the use of the addition and multiplication theorems in the cal¬ 
culation of desired probabilities. Some Avill also serve to illus¬ 
trate the use of the normal probability curve. 

Examples Involving Discrete Distributions. Suppose that a 
gambling game consists of the random tossing of five coins. 
You agree to pay your opponent a predetermined sum of money 
whenever all five coins turn up heads; he agrees to pay you a 
predetermined sum whenever any other result occurs. The ques¬ 
tion is: What should the odds be to make the game a fair one? 
The answer is obtained as follows: 

Assume that the character of the coins and the method of 
tossing are such as to cause each coin to tend to turn up heads 
and tails in equal proportion. In a large number of tossings, 
therefore, the probability of a head on each coin may be taken 
as Assume, also, that the method of tossing is such as to 
make the tosses of each coin independent of the others. Then 
the probability of heads on all five coins is 

(mxmxi) = ay = ik 

Accordingly, the fair odds are 31 to 1. That is, the game will 
be fair if you agree to pay your opponent $31 every time five 
heads occur and he agrees to pay you $1 every time some other 
result occurs. Of course, in an actual game the assumption 
regarding the character of the coins and the method of tossing 
would have to be checked by examination of the coins and by 
trial tossings. This is an illustration of the multiplication 
theorem for independent probabilities. 

Another gambling game consists of the throwing of two dice. 
You agree to pay your opponent a predetermined sum when¬ 
ever a combination totaling 7 appears, and he agrees to pay 
you a predetermined sum whenever another result appears. 
Again the problem is to determine fair odds. This may be done 
by a combined application of the multiplication and addition 
theorems. 
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Assume again that the dice are of such a character and are so 
thrown that all faces tend to turn up in equal proportions. The 
probability of any given result for each die is therefore The 
six possible combinations that add up to 7 are (1,6), (2,5), (3,4), 
(4,3), (5,2), and (6,1). If it is assumed that the dice are thrown 
so as to give independent results, then, by the multiplication 
theorem for independent probabilities, the probability of each 
one of the above combinations is (i)(i) = Any one of the 
combinations, however, will yield a total of 7. The probability 
of a total of 7 is therefore the probability of any one of these 
combinations, which, by the addition theorem, is 

(■gV) + (isfV) + (-sV) + (-sV) + (-g-V) + (-arV) = -/e = i 

Hence, fair odds are 5 to 1; that is, the game will be fair if you 
pay your opponent $5 every time a 7 occurs and he pays you $1 
every time some other total occurs: Again, in a real game the 
character of the dice and the method of throwing should be 
checked to see if the above assumptions are warranted. 

Consider still a third gambling game. Suppose that two cards 
are drawn at random from a well-shufHed pack, the suit is noted, 
the cards are returned to the pack, the latter is shuffled, and 
the whole operation is repeated. Each time the cards are all 
spades you agree to pay your opponent a predetermined sum; 
if they are otherwise, he pays you. What are fair odds? 

Assume as in the other games that the method of drawing 
cards and the method of shuffling are such that all cards tend 
to be drawn in equal proportion. The probability of a spade 
among the first cards drawn will thus be if, assuming the usual 
deck of 13 spades, 13 hearts, 13 diamonds, and 13 clubs. If 
the first card drawn is a spade, the remainder of the deck con¬ 
tains 12 spades and 13 of each of the other suits. If in a large 
set of drawings each of these remaining cards tends to turn up 
in the same proportion as every other card, then the probability 
of a spade among the second cards drawn will be if- This is 
the probability of a spade on the second draw, assuming a spade 
on the first draw. Then, according to the multiplication theorem 
for dependent probabilities, the probability of a spade on both 
the first and second draws is (H)(if) = A- The odds will be 
fair, therefore, if you pay $48 every time two spades are drawn 
and your opponent pays $3 every time any other combination 
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is drawn. Again the assumptions would have to be checked in a 
real game. 

Examples Involving Continuous Distributions. All the exam¬ 
ples so far have been concerned with discrete probability distribu¬ 
tions. Much of the practical work in statistics, however, is 
concerned with continuous distributions. As the first example of 
this kind, consider the distribution of heights of eighteen-year-old 
boys. The fitting of a normal curve to the heights of 300 
eighteen-year-old Princeton freshmen^ suggests that in general 
the forces of nature are such as to cause a normal” distribution 
of heights. If this is assumed to be the case, then the normal 
curve can be employed to calculate the probability of an eighteen- 
year-old boy having a height lying between any given range. 
This is done as follows: 

As indicated above, ^ if the distribution of probability follows 
the normal law, then the probability of an attribute ranging from 
x/a to x/(T + d{x/o) is given by the formula 



where x/d represents a deviation of the attribute from the mean 
attribute measured in a units, x is the standard deviation of the 
distribution, and dix/d) is an infinitesimally small range. This 
represents approximately the area under the curve for the infini¬ 
tesimal range xjd io x/d -\r dix/d'). A finite range running, say 
from x\ld to x^jd^ can be conceived as made up of a number of infini¬ 
tesimal ranges of size d{xfd)) and the probability of an attribute 
ranging from x\/d to X 2 /(r is (by the addition theorem) merely the 
sum of the probabilities for each of these infinitesimal ranges, viz., 


a‘2 

<r 



<T 


or, in the notation of the integral calculus, 



<r 


1 See pp. 296—306 and especially Fig. 101. 

2 See pp. 264-267. 
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In other words, the probability of an attribute ranging from 
Xi/<T to X 2 /<r is simply the area under the curve for this range. 
This is graphically shown in Fig. 94 and is a direct result of the 
addition theorem. 

The area under the normal curve for any given range, might, as 
indicated, be found by evaluating the integral^^ 



This is not an easy task, however, even for those who understand 
advanced mathematics. Consequently, tables have been pre- 



Fig. 94. —Illustration of computation of probability of an xju lying between 

xi/<r and 

pared that give the approximate areas under the normal curve 
for certain specified ranges and that permit by simple arithmetical 
operations the calculation of areas for all other ranges. Such a 
table is Table VI of the Appendix, page 693. This gives the 
proportionate area under the positive half of the normal curve 
from the mean (x/c = 0) to various selected points. Thus from 
the table it is seen that the proportion of the area lying under the 
normal curve from x/o- = 0 to x/a = 0.2 is 0.07926. 

In addition, since the proportion of the area under the normal 
curve from x/o* = 0 (the mean) to infinity is 0.50000, the pro¬ 
portion of area under the curve from any selected point to infinity 
can be readily calculated. Thus the proportion of the area under 
the curve from x/o- = 0.2 to infinity is 0.42074 (i.e., 0.50000 — 
0.07926), the proportion of area from x/o- = 1.96 to infinity is 
0.02500 (i.e., 0.50000 — 0.47500), etc. Owing to the symmetry 
of the curve, the same values hold true for areas from x/cr = 0 to 
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x/a- = — 00. Thus the proportion of the area for the range from 
x/iT = -0.2 to x/a = - 00 is 0.50000 - 0.07926 = 0.42074. 

To find proportionate areas for other ranges, it is necessary 
merely to add or subtract proportionate areas given directly by 
the table. Thus, the proportion of area from the range x/<r == 0.2 
to x/a = 0.3 is the difference between the proportionate area 
from xl<F = 0.3 to the mean, and the proportionate area from 
xl(T = 0.2 to the mean, i.e., 0.11791 — 0.07926 = 0.03865. Like¬ 
wise, the proportionate area under the curve for the range 
x/<r = —0.2 to x/<T = +0.3 is simply the sum of the propor¬ 
tionate area from x/<r = —0.2 to x/o- = 0, and the proportionate 
area from x/o- = 0 toa;/<r = 0.3, i.e., 0.07926 + 0.11791 = 0.19717 . 
Proportionate areas for obscure points not given directly or 
indirectly by the table may be obtained by interpolation; usually, 
straight-line interpolation (Le., the calculation of simple pro¬ 
portionate differences) gives satisfactory results. 

To make use of Table VI in a given problem it is merely neces¬ 
sary to convert the original measurements into deviations from 
the mean expressed in <t units, Le., to convert original units into 
<r units. The mean height of eighteen-year-old boys, for example 
(as estimated from the heights of eighteen-year-old Princeton 
freshmen of the class of 1943), is 70.47 inches, and the standard 
deviation of heights is 2.49 inches. Hence the probability of an 
eighteen-year-old boy 72 to 73 inches tall is given by the area 
under the normal curve from 


a; _ 72 - 70.47 
<r 2.49 


0.61 


to 


X _ 73 - 70.47 
(T 2.49 


1.02 


This, in accordance with the method outlined in the previous 
paragraph for calculating such an area, is 0.11707. Similarly, the 
probabiMty of a boy taller than 74 inches is given by the area 

X 74 — 70 47 

under the normal curve from - = -2+9~— ~ infinity, 

which the table shows to be 0.50000 — 0.42220 = 0.07780. 
Again, the probability that two boys picked at random should be 
taller than 74 inches is the product of the two individual proba¬ 
bilities (the multiplication theorem for independent proba¬ 
bilities) or 

(0.07780) (0.07J80) == 0.00605 

Table VI thus readily facilitates the calculation of probabilities 
whenever the primary distribution or distributions follow the 
normal law. 



CHAPTER XI 


SYMMETRICAL BINOMIAL DISTRIBUTION AND THE 
NORMAL CURVE 

INTRODUCTION 

The preceding chapters have been concerned with probability 
and the probability calculus. These were discussed for the 
purpose of providing tools for subsequent analysis. In this 
chapter the tools will be employed in developing a theoretical 
explanation of the normal frequency curv^e. The line of attack 
will be as follows. 

The argument will begin with an abstract study of a simple 
problem in combinatorial analysis. The basic data will be 10 
coins, each of which has two sides. These sides will be marked 
with a head or a tail, and each coin will have one head and one 
tail. 

The combinatorial problem will be the determination of the 
relative frequencies or probabilities of various types of combi¬ 
nations in the whole set of combinations that might be made 
from various arrangements of the given set of coins. Thus 
the theoretical problem will be the determination of the relative 
frequencies or probabilities of combinations having 0, 1, 2, . . . , 
10 heads in the whole set of combinations that might be con¬ 
structed from various arrangements of the 10 coins. 

In the terminology of probability, this combinatorial problem 
consists of the derivation of a certain second-order probability 
set from the elementary probability set. To put this in another 
way, the problem is to find the type of frequency or probability 
distribution that results from the combination of certain elemen¬ 
tary frequency or probability distributions. Attention will in 
particular center upon the form of the derived frequency or 
probability distribution. Exact and approximate formulas will 
be determined for this distribution. 

The purely theoretical part of the theory of the normal curve 
will thus be a set of problems in the probability calculus. What 
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is ultimately desired, however, is the explanation that this 
distribution affords of some of the frequency distributions that 
appear in real life, such as the frequency distributions of the 
heights of adult white males, the frequency distribution of 
samples from a given population, and the like. This explana¬ 
tion will be undertaken after the completion of the combinatorial 
analysis. 


SYMMETRICAL BINOMIAL DISTRIBUTION 

Derivation. As already suggested, the discussion of the theory 
of the normal frequency curve will begin with the analysis of a 
simple problem involving 10 coins. Each coin, it will be assumed, 
has two sides, one of which is a head, the other a tail. Since the 
probability of an object has been defined as its relative frequency 
in the set of objects to which it belongs, it may be said that for 
each coin the probability of a side being a head is i and the 
probability of its being a tail is also J. The problem to be 
considered is this: If the 10 coins are combined in all possible 
ways, the selection of a head or a tail for any one coin being 
independent of the selection for other coins, what are the various 
types of combinations of heads and tails that will be produced 
and what will be the probability of each type in the set of all 
possible combinations? This is a straightforward problem in the 
theory of combinations and may be solved as follows. 

To facilitate the analysis, let the 10 coins be distinguished 
by the letters A, C, D, E, F, G, H, 7, and /. A combination 
having 0 heads, for example, mil be represented as follows. 


ABC 
T T T 


D E F G 

T T T T 


H 

T 


I J 
T T 


a combination having 4 heads as follows, 


ABCDEFGHIJ 
HTTHHTTTH T 


etc. 

Consider first the combination having 0 heads. Since the 
probability of a tail on each coin is the probability of 0 heads 
is For the probability of A being a tail is i, and the same 

is true for B, C, 7), E, F, G, Hy 7, and J, Furthermore, since the 
probability of a tail for any one coin is independent of what 
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I . 

the other coins are, the probability of the above result is, by 
the multiplication theorem, the product of the 10 independent 
probabilities, or (i)(i)(i)(i)(i)(i)(i)(i)(i)(i) = (i)io. Finally, 
it is to be noted that this result can be obtained in only one way. 
Hence it is to be concluded that the probability ^of 0 heads is 
1/210 = 1/1,024. 

Consider next the following combination: 

ABCDEFGH I J 
HTTTTTTTTT 

This is a case of 1 head. Since the probability of A being 
a head is i and the probability of each of the other coins being a 
tail is also i and since each of these results is independent of the 
others, it follows that the probability of this particular com¬ 
bination of heads and tails is again But there are also 

other combinations having only 1 head. Such are 

ABCDEFGH IJ 

T H T T T T T T T T 

TTHTTTTTTT 
T T T T T T T T T H 

In fact, it is readily seen that there are 10 combinations 
altogether in each of which a different coin is the one being a 
head. The probability, therefore, of any one of these 10 com¬ 
binations, i,e,j the probability of a head on some one and only 
one of the 10 coins is, by the addition theorem, 10(i)^° = 10/1,024. 

Consider now the combination 

ABCDEFGH IJ 

HHTTTTTTTT 

This is a case of 2 heads. Since the probability of A being a 
head is i, the probability of B being a head is ^ and the prob¬ 
ability of each of the other coins being a tail is likewise and 
since each of these results is independent of all the others, it 
follows once more that the probability of this particular com¬ 
bination is 

But, as previously, this is not the only combination having 
2 heads. The reader himself vnll be able to write down a number 
of other combinations in which only 2 heads appear. The 
question is how many different combinations of the 10 coins 
have 2 and only 2 heads? This is answered by the theory of 
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permutations and combinations outlined in Chap. X.^ Thus 
the number of dilBferent combinations of 10 coins taken 2 at a 
time is 



45 


[Cf. Eq. (3), page 234.] There being, therefore, 45 different 
combinations, each of which has a probability of it follows 
that the probability of any one of them is 



The probability of other possible combinations is determined 
in a similar manner. In general, the probability of heads is 


( 10 )! (lY 

- Nd\ \2j 


Thus the probabilit}^ of 3 heads and 7 tails is 


10 ! (lY 

3! 7! \2/ 


120 



10 


The probability of 6 heads and 4 tails is 


10! / iV® 
4!6!V2/ 



etc. The results obtained by use of this formula may be tabu¬ 
lated as follows; 


Table 19.—Probabilities of Various Combinations among All Possible 


Combinations of 10 Coins 


Combinations Having 

Probability 

0 head 

1/1,024 = 0.00098 

1 head 

10/1,024 = 0.00977 

2 heads 

45/1,024 = 0.04394 

3 heads 

120/1,024 = 0.11719 

4 heads 

210/1,024 - 0.20508 

5 heads 

252/1,024 = 0.24609 

6 heads 

210/1,024 = 0.20508 

7 heads 

120/1,024 « 0.11719 

8 heads 

45/1,024 = 0.04394 

9 heads 

10/1,024 « 0.00977 

10 heads 

1/1,024 « 0.00098 


1 See pp. 232-234. 
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It will be noted that the series of probabilities of 0, 1, 2, , 

10 heads may be obtained by the expansion of d + This 

distribution of probability is consequently called a “binomial” 
distribution.^ If N coins had been used instead of 10, the 
probabilities of the distribution would have beei> given by the 
terms of the expansion of Thus the probability of 

a combination having Ni heads among all possible combinations 
of N coins is^ 

or if Ni is set equal to N — iVi, 


™ ( 0 ” 

This is the general formula for a symmetrical binomial dis¬ 
tribution. 

Character of the Symmetrical Binomial Distribution. A 

graph of the probabilities given in Table 19 is presented in Fig. 
95. It mil be noted from the table and also from the figure that 
the probability of 0 heads is the same as the probability of 10 
heads, that the probability of 1 head is the same as the prob¬ 
ability of 9 heads, etc. In other words, the distribution of 
probabilities is symmetrical about a central point, in this case 
the point representing 5 heads. This symmetry is the reason 
for the name “symmetrical” binomial distribution. 

Mathematical analysis shows that in general the symmetrical 
binomial distribution has the following characteristics:® 

N 

Mean = ^ 

-/t I 

^1 = 0 > (2) 

2 \ 

Pt ^ S ^ j 

1 Cf, p. 234. 

* Ibid, 

® These formulas are derived in Smith and Duncan, Sampling Statistics, 
pp. 65-67. 
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It will be sufficient, to check these equations here by finding the 
mean, standard deviation, fii, and ^2 of the distribution of 
Table 19. 



Fig. 95.—Graph of a symmetrical binomial distribution. 

The mean of a distribution of probability, it will be recalled, 
is equal to the sum of the attributes times their probabilities.^ 
The mean of the distribution of Table 19 is thus 


1,024 1,024 ■*" 1,024 “*■ 1,024 1,024 

252 21^ 120 _4^ , . . . 

1,024 1,024 1,024 1,024 1,024 




According to the formula, the mean equals i\r/2 = V' = 5, which 
is the same value as that derived above by direct calculation. 

Similarly, the variance of a distribution of probability is equal 
to the sum of the deviations from the mean squared and multi- 
1 See p. 169. 
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plied by their probabilities. Hence, the variance of the dis¬ 
tribution of Table 19 is 


1 


(-5^ + 


10 

1,024 


(-4)^-b 


45 

1,024 


(~3y + 


120 

1,024 


(- 2 )^ 


1,024 

-L f—ll^ -I- 101* -I- 111* 4- 121* 

^ 1,024 ^ ^ 1,024 ^ 1,024 ^ 1,024 

1^4 T;624 1^024 


This again checks with the formula, which gives = N/4: = 2.5. 

Likewise, the third moment about the mean of a probability 
distribution is the sum of the deviations from the mean cubed 
and multiplied by their probabilities, and the fourth moment is 
the sum of the deviations from the mean raised to the fourth 
power and multiplied by their probabilities. Thus, for the dis¬ 
tribution of Table 19, 


45 


"" 17024 i;m4 

17024 17024 1,024 

17624 17624 17624 


(-3)* 


1,024 

(0)^ + ^^4 (I)** 


= 0 


and 


M4 

4 


Since, by definition, /3i = M 3 /M 2 “ M4/mL follows that for 

17.5 

this distribution jSi is zero and P 2 = ^2 5)2 ~ which are the 

values again given by the general formulas. These formulas are 
valid for all symmetrical binomial distributions. 

The Normal Curve. If 40 instead of 10 coins were involved, 
the distribution of probability would be considerably more 
spread out than that of Table 19. This is readily seen from 


1,024 ^ 1,024 ^ 


1,024 

(- 2 )^ + ^ (- 1 )^ + ( 0 )^ + ( 1 )^ 


1,024 


1,024 


1,024 


1,024 

+ (2)‘ + (3)* -f- (4)‘ -f {by = 17.5 


1,024 


1,024 


1,024 


1,024 
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Fig. 96. In general, the formula o- = \/iV/4 indicates that the 
dispersion of the distribution increases in proportion to y/N. 
If the horizontal scale is reduced, however, and the vertical scale 
enlarged, in the same proportion that the dispersion of the dis¬ 
tribution is increased, then the effect of increasing N is to bring 
the ordinates of the distribution closer together and to raise them 

0240 
0.220 
0.200 
0.180 
0.160 
0.140 
0.120 
0.100 
0.080 
, 0.060 
0.040 
. 0.020 
0 

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 
01 23456789 10 

Fig. 96.—Graphs of two symmetrical binomial distributions, one for N = 10, 
the other for N — 40. 



to the height of the original distribution. Under these condi¬ 
tions the tops of the ordinates tend to sketch out a smooth curve 
as N is increased. This is indicated in Fig. 97. It can be shown 
that the limit that the symmetrical binomial distribution 
approaches as N is increased, while at the same time the scales 
are adjusted in proportion to y/N, is the normal curve 


1 




That the symmetrical binomial distribution approaches the 
normal curve as a limit can be definitely proved by rigorous 
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mathematical analysis. ^ Certain general considerations, however, 
suggest this same conclusion. 

1. The distributions of Figs. 96 to 97 have a shape similar to 
that of the normal curve; and if a normal curve with the same 



Fig. 97.—Illustration of effect of scale adjustments on a symmetrical binomial 

distribution. 

mean as any one of these distributions and the same standard 
deviation is graphed together with that distribution, the curve 
is seen to be a good This is shown^ in Fig. 98. 

^ This is demonstrated in Smith and Dimcan, Sampling Statistics, pp. 68-74. 

2 The binomial distribution is a discrete distribution, and its probabilities 
are correctly represented by a series of ordinates as in Figs. 96 and 97. It 
is the ordinates of the normal curve of Fig. 98 at 1/cr, 2/<r, etc., and not 
sections of the curve area that approximate the binomial ordinates at these 
points. As pointed out, however, in Smith and Duncan, Sampling Statistics, 
p. 74, it is possible to represent any symmetrical binomial distribution by a 
histogram whose area-is approximated by that of a normal curve. In this 
way the area tables of the normal curve can be used to approximate a series 
of binomial probabilities. 
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Fig. 98. —Comparison of a symmetrical binomial distribution with the normal 


curve. 



Fig. 99. Graph illustrating computation of relative slope of frequency polygon 

at a given point. 
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2. Equations (2)* show that Pi = 0 for the symmetrical 
binomial distribution and that P 2 approaches the value 3 as iV 
is increased. These are also the values of Pi and P 2 for the 
normal curve. 

3. If a graph is made of the symmetrical binomial distribution 
in the form of a frequency polygon, the relative slope of any side 
of this polygon at its mid-point is the same as the relative slope 
of a normal curve at that point. Figure 99 shows, for example, 
that for N = 10 the ordinate of the symmetrical binomial at 
iVi = 6 is equal to 210/1,024 and the ordinate at iVi = 7 is 
120/1,024. The mid-point between 6 and 7 is 6.5, and the 
ordinate of the polygon at that point is 

1/210 120 \ 165 

2 \1,024 1,024/ 1,024 

The absolute slope of the polygon at this mid-point is given by 
the ratio of "the difference between the ordinates at 7 and 6 (that 
. 120 210 

1,024 1,024 1 

points 6 and 7 (that is, 7 — 6 = 1); and the relative slope at the 
mid-point is given by the ratio of the absolute slope to the ordi¬ 
nate at that point. Thus, the relative slope of the polygon of 
Fig. 99 at the mid-point 6.5 is 


% 

,024 




to the distance between the abscissa 


90 

1,024 \ _ 
165 


90 

165 


= -0.545 


1,024 


In general, the relative slope of a symmetrical binomial distribu- 


1 . 




tion at any mid-point iVi + « is equaF to - ^ - ■ 

^ iV + 1 


1 N 

If X is set equal to + 2 deviation of the 

N . . iV + 1 

mid-point from tjie mean and if 2k^ is set equal to -— 2 —' 

* Page 283. 

^ See Smith and Duncan, Sampling Statistics, pp. 74-76i 
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2x 


this expression for the relative slope at point x becomes — gp or 


w 


But it can be sho^vn^ by the differential calculus that the 


relative slope of the normal curve at any point x is where 

X = X — X. Hence the relative slope of the symmetrical 
binomial distribution at any mid-point is the same as that of a 


normal curve whose standard deviation is equal iok = 


4 


W + 1 


which, if N is large, is practically the same as the standard 
deviation of the symmetrical binomial distribution. ^ 


CONDITIONS PRODUCING THE SYMMETRICAL BINOMIAL AND 
THE NORMAL CURVE IN REAL LIFE 

The foregoing sections have been devoted to the derivation and 
description of a particular frequency distribution known as the 
binomial distribution. The analysis has consisted entirely of an 
application of the probability calculus, and the result is an 
abstract distribution of probability. Since the ultimate purpose 
of the analysis is an explanation of some of the frequency distri¬ 
butions of real life, it is desirable at this point to consider the 
question: What is the relationship between the symmetrical 
binomial distribution and a frequency distribution of real life? 

Consider first the following hypothetical experiment: Suppose 
that the 10 coins referred to in the theoretical discussion are 
tossed a large number of times and the relative frequencies with 
which they come up 0, 1, 2, ... , 10 heads are computed. 
What will be the results? Actually, no precise prediction can be 
made. Intuition suggests, however, that, if the coins are sym¬ 
metrical and are tossed in an unbiased fashion, the relative 
frequencies with which the combinations 0, 1, 2, . . . , 10 heads 
will appear will be close to the probabilities of these combinations 
among the whole set of combinations that could be made from 
10 coins. For if coins are tossed at random, it is to be expected 
that a head will appear on any one coin as often as a tail. The 
randomness also ensures that the appearance of a head on one 
coin will be independent of the appearance of a head or a tail on 
any other coin. Under these conditions it would seem likely 

^ Ibid. 

2 See Eqs. (2), p. 283. 
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that any particular arrangement of heads and tails would occur 
just as often as any other arrangement. Therefore, the relative 
number of times 3 heads and 7 tails would appear, for example, 
would be equal to the relative number of arrangements that 
would yield, 3 heads and 7 tails out of the set' of all possible 
arrangements. This is the relative frequency of the binomial 
distribution. Intuition thus suggests the results of random 
coin tossing will be closely approximated by the binomial fre¬ 
quencies. Actual experiments lend support to this argument, 
so that it would seem possible to predict the results of a large 
number of tossings by the use of the probability calculus. This 
is merely an application of the law of large numbers. 

The relationship between the results of coin tossing and the 
binomial probabilities suggests even more important inferences. 
For there may be conditions in real life that are similar to those 
involved in the tossing of coins, and statistical variables produced 
b}’' these conditions may be expected to follow the symmetrical 
binomial distribution and in special instances the normal curve. 
To illustrate the conditions that might give rise to such results 
consider the following examples: 


Example 1. Suppose that the sex of the offspring of a certain animal is 
determined by the type of the egg cell in the female that unites with the 
sperm cell of the male, and suppose that the number of egg cells in 
the female that will produce male offspring is on the average equal to the 
number of egg cells that will produce female offspring. If sperm cells unite 
with egg cells in a random manner, the chance is \ of an offspring being a 
male and | of its being a female. These are essentially the same conditions 
that determine whether a symmetrical coin should turn up heads or tails 
when tossed at random. Under such conditions the frequency distribution 
of the number of males in families of a given size should theoretically follow 
the symmetrical binomial distribution. Thus families of size 5 should be 
expected to vary in sex combination as follows: 


Number 
of Males 
0 
1 
2 

3 

4 

5 


Percentage of Families Having 
Specified Number of Males 
^ = 0.03 
A « 0.16 
H « 0.31 
JJ = 0.31 
A « 0.16 
A = 0.03 


A study of the^?ex of pigs in 116 litters of 5 pigs each showed the following: 
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Number 
of Males 
0 
1 
2 

3 

4 

5 


Percentage of Litters Having 
Specified Number of Males 
0.02 
0.17 
0.35 
0.30 
0.12 
0.03 


The closeness of these figures to those above suggests that the theory of 
sex determination outlined above might very well be valid for pigs. 

Example 2. In Example 1, conditions were such as to produce a variable 
(number of males) that was discrete and integral. The present hypothetical 
example will suggest conditions which might produce a variable that was 
discrete but not integral and that was distributed in the form of a sym¬ 
metrical binomial. It also suggests conditions under which the variable 
might be practically continuous and distributed like a normal curve. 

Suppose that there are a large number of bags of flour, say 10,000, each 
weighing exactly 5 pounds. Suppose that an experimenter opens each bag 
in succession and adds or subtracts a certain quantity of flour to each bag 
in accordance with the following rule: Whenever he opens a bag, he also 
tosses 10 coins. For each head that appears he adds an ounce of flour to the 
bag; for each tail, he subtracts an ounce. The result of this procedure will 
be 10,000 bags of flour varying in weight from 5 pounds — 10 ounces to 
5 pounds -f 10 ounces, the unit difference being 2 ounces. In accordance 
with the foregoing analysis, the distribution of the weights of these bags of 
flour would be approximately as follows: 


Weight o^^Bag Relative Frequency of Occurrence 


lb. 6 oz. 

1/1,024 

lb. 8 oz. 

10/1,024 

lb. 10 oz. 

45/1,024 

lb. 12 oz. 

120/1,024 

lb. 14 oz. 

210/1,024 

lb. 0 oz. 

252/1,024 

lb. 2 oz. 

210/1,024 

lb. 4 oz. 

120/1,024 

lb. 6 oz. 

45/1,024 

lb. 8 oz. 

10/1,024 

lb. 10 oz. 

1/1,024 


In other words, the distribution of weights would approximately conform 
to a symmetrical binomial distribution with a mean weight of 5 pounds and a 
standard deviation of 2.5 X 2 — 5 ounces. 

This shows how a variable may be produced that is discrete but not 
integral and that is distributed in the form of a symmetrical binomial 
distribution. To produce a . variable that is practically continuous, it is 
necessary to increase the number of coins from 10 to 100, say, and to reduce 
the amount of flour added or subtracted to 0.01 ounce. Differences as 
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small as 0.02 ounce would thus be possible, and for all practical purposes 
the weight of a bag of flour could be said to be continuous. Ufider these 
conditions a graph of the distributions of weights would be practically con¬ 
tinuous and as indicated in the theoretical 
discussion would have the form of a normal 
frequency curve. 

Example 3. Example 2 was entirely 
hypothetical. An apparatus has been 
constructed, however (see Fig. 100), that 
reproduces in somewhat different form the 
conditions of Example 2. By its use the 
results predicted in Example 2 can be 
concretely illustrated. 

The apparatus of Fig. 100 was devised 
many years ago by Sir Francis Galton and 
subsequently elaborated by Karl Pearson.^ 

As represented in Fig. 100 it consists of a 
series of rows of wedges, each row contain¬ 
ing an additional wedge and so arranged 
that its wedges come halfway between the 
wedges of the row above. If the wedges 
are placed 1 centimeter apart, then a small 
ball dropped into the top of the machine 
will have an equal chance in each row of 
being deflected 0.5 centimeter either to the 

left or the right. The apparatus of Fig. ^2 -j o I 2 3 4 5 

100 has 10 rows. The final deviation of 

the ball from the central point 0 will thus piQ, lOO.— The Pearson-Gal- 
be the algebraic sum of the left (minus) ton apparatus for physical 
and right (plus) deflections as it falls derivation of a binomial distri- 
through the 10 rows. The possible range ^tion. 

of this final deviation is from —5 to +5 centimeters. Since the probability 
of a plus and minus deviation of 0.5 centimeter is in each row equal to i 
(similar to the probability of a head and a tail for a coin) and since there 
are 10 rows (as there were 10 coins in the previous case), the probabilities of 
final deviations of —5, —4, —3, —2, —1, 0, +1, +2, +3, +4, +5 centi¬ 
meters will be the same as those of the binomial distribution, 



P{N,) 


10! /lyo 

7^^11(10 - Ni )\\ 2 ) 


which are given in Table 19, page 282. 


^ Galton, Francis, Natural Inheritance (Macmillan & Company, Ltd., 
London, 1889), p. 63; Pearson, Karl, ‘‘Skew Variation in Homogeneous 
Material,’^ Philosophical Transactions of the Royal Society of London, Series 
A, Vol. 186 (1895), p. 343. Pearson’s contribution was to replace the set of 
nails used by Galton by a set of sliding wedges that could be so adjusted 
that the chances of deflection to the left and right were not equal. Figure 
100 follows the pattern of Galton’s apparatus. 
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These are the theoretical probabilities of the apparatus. If a large 
number of balls are actually dropped into the machine, the exact result 
cannot be predicted. Intuition suggests, however, that the relative fre¬ 
quencies with which the balls will pile up in the different slots will tend to 
approximate the theoretical probabilities and this is demonstrated by actual 
experiments. Such a result is pictured in Fig. 100 by the shading of the 
slots in proportion to the binomial probabilities. 

It will be noted that in this case the variable, that is, the final deviation 
of a ball from the central point 0 is again discrete. Deviations of integral 
centimeters only are possible. If, however, the number of rows were 
increased from 10 to 1,000, say, and if at the same time the wedges were 
reduced to 0.01 centimeter in size and placed so that they were only 0.01 
centimeter apart (the balls would, of course, have to be correspondingly 
reduced in size), then the final deviations would vary by 0.01 centimeter 
and might be practically considered a continuous variable. The distri¬ 
bution of relative frequencies would in this case closel}^ approximate a 
smooth frequency curve, which would once again be the normal curve. 

Theory of Errors. Errors in physical measurements may be broken up 
into several components. (1) The ^instrumental error may be attributed 
to the particular instrument with which the measurement is made; every 
measurement by it will contain a certain error that may be assigned to that 
instrument. (2) The ‘^personal error’’ may be attributed to the particular 
person undertaking the measurement; every observation by him will be 
influenced by his ^‘personal equation.” (3) Another component error may 
be attributed to particular external conditions, such as the temperature, 
sunlight, and wind. These errors due to the instrument, the observer, and 
specific external conditions are all ^‘systematic errors” that can be allowed 
for. (4) A final component error is the “incidental error,” or “residual 
error,” to which no definite cause can be assigned. Such errors are the 
result of the whole host of chance forces, the same sort of forces that deter¬ 
mine whether an unbiased coin comes up heads or tails. The total acci¬ 
dental error in any individual measurement may be taken to be the sum 
of a number of small accidental errors arising from different causes. ^ Slight 
irregular changes in external conditions, such as the vibration on account of 
air currents or irregular changes in the personal equation of the observer, 
are examples of causes for accidental error of measurement. If it is pos¬ 
sible to discover the law of action of any error, it is thereby removed from 
the class of accidental errors to the class of systematic errors. 

If the number of forces affecting the residual errors in any series of 
measurements is large, if each causes a very small plus or minus deviation 
from the true value, and if the probability of a plus and a minus deviation 
is for each force equal to i, then, as in the case of the flour-bag experiment 
and the Pearson-Gdton apparatus, the final residual errors of the series of 
measurements will tend to be distributed in accordance witfi the normal 
curve. The mean of this curve will be the true value (after allowance has 

^ See Brunt, David, The Combinations of Observations^ pp. 3-4. 
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been made, of course, for the systematic errors mentioned above), and the 
standard deviation of the curve will be an index of the precision of measure¬ 
ment. This is the theory of errors.^ It is supported by the close agreement 
between the normal curve and distributions of actual measurements. In 
fact, the normal curve is often spoken of as the error curve’^ or the ‘‘Gaus¬ 
sian error curve,after the man who was among the first to recognize the 
possibility of applying the theory of probability to the investigation of the 
errors of measurement.® 

Summary of Conditions Leading to the Symmetrical Binomial 
Distribution and the Normal Curve, The foregoing examples 
suggest that whenever the following conditions exist in real 
life, the data generated by these conditions will tend to be dis¬ 
tributed in the form of a symmetrical binomial distribution and, 
if certain other conditions are also present, in the form of a 
normal curve. The conditions giving rise to the symmetrical 
binomial distribution may be stated as follows: 

1. In the absence of certain causes^' of variation or in the 
event of a perfect balancing of their effects, the data assume a 
fixed central value. (The 5 pounds of the flour illustration, the 
‘Hrue value in a series of measurements.) 

2. Deviations from this central value result from certain 
causes’^ of variation, the effect of any causebeing either to 

add a fixed quantity to the data or to subtract the same quantity. 
(To add or subtract 1 ounce of flour or to add or subtract an 
errorof 0.5 centimeter.) 

3. A cause of variation tends to produce positive effects and 
negative effects in equal proportion, that is, P(+) = P( — ) =i. 
(The probability of a head equals the probability of a tail; the 
probability of a positive error equals the probability of a negative 
error.) 

4. The effects of all contributory causes of variation are of 
equal magnitude. (Each adds or subtracts 1 ounce of flour 
or 0.5 centimeter.) 

^ Actually this is a special case of a more general statement of the theory. 
As pointed out in Smith and Duncan, Sampling StatistieSy p. 97, each force 
may cause deviations of varying size with varying probabilities and the 
final residual errors will still tend to be normally distributed provided that 
the number of forces is very large and the relative impbrtance of each is 
about the same. 

® For Gauss’s fundamental works see Ahhandhmgen zur I.Iciltodu dtr 
kleinsten Quadrate (A. Borsch and P. Simon, Berlin, 18S7). 
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5. The contributory causes are independent in their action. 
In other words, the contribution of a positive or negative effect 
by any causal factor is independent of the previous contributions 
of other causal factors. 

6. The total deviation of any element from its central value 
is the algebraic sum of the positive and negative contributions of 
the individual causal factors. (The total amount of flour added 
or subtracted from a bag is the sum of the ounces added for each 
head tossed minus the ounces subtracted for each tail tossed.) 

If in addition to these conditions, the following also exist, then 
the resulting distribution will tend to conform to the normal 
curve: 

7. The number of contributory causes is very large. (A 
large number of coins are tossed; the binomial machine contains 
a large number of rows.) 

8. The positive and negative contributions of each cause are 
very small. (If 0.01 ounce is added or subtracted instead of 1 
ounce; if 0.005 centimeter, instead of 0.5 centimeter.) 

It is to be noted that, so far as the normal curve is concerned, 
not all these conditions are necessary for its generation. The 
above conditions will produce it, but the normal curve may also 
occur when some of these conditions are absent.^ It may be 
stated here that the normal curve will still be produced if con¬ 
ditions 2 and 3 are relaxed so that a causal factor may affect the 
data in varying degree and with varying probabilities and also if 
condition 4 is only approximately and not exactly true.^ The 
most important conditions are 6 to 8 and condition 4 in an 
approximate form. For example, in the case of the flour illus¬ 
tration, the resulting weights of the bags of flours would still tend 
to be normally distributed even if biased dice instead of unbiased 
coins were used and if the amount of flour added or subtracted 
varied with the result of the throw (say 0.001 ounce for the 
occurrence of a one, -^0.002 ounce for the occurrence of a two, 
0.003 ounce for the occurrence of a three, —0.004 for the occur¬ 
rence of a four, etc.) provided that the number of dice thrown 
was very large and the amount added or subtracted per die was 
very small and of about the same order of magnitude from die to 

^ See Smith and Duncan, Sampling Statistics, pp. 97-100. 

2 XJnder certain conditions the requirement of independence (condition o) 
may also be relaxed. See Smith and Duncan, Sampling Statistics, pp. 63-65. 
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die. The normal curve is thus a more general phenomenon than 
the symmetrical binomial distribution.^ 

Examples of Normal Frequency Distributions, Natural forces 
appear to generate normal frequency distributions in many fields. 
Physical measurements have already been menticmed. Figure 
101 shows the distribution of heights of 300 eighteen-year-old 
Princeton freshmen. The grades of students on examinations, 
hourly earnings of workers, the length of life of electric- 
light bulbs, the distance of baseball throws of first-year high- 
school girls are all normally distributed variables. In these 
fields and in many others, it would seem that the conditions of 
variation are those which theoretically give rise to the normal 
curve. 



Fig. 101.—Normal curve fitted to heights of 300 Princeton freshmen, 

DETERMINATION OF NORMALITY 

Several procedures are available for determining whether the 
population from which a given set of sample data has been taken 
might reasonably be considered to conform to the normal curv^e. 
In general, these consist of comparing the histogram constructed 
from the sample data with a normal curve “fitted” to this histo¬ 
gram. The difference in the various procedures lies in the bases 

^ Mathematically the normal curve can be derived from a great variety of 
different assumptions. See, for example, Czuber, Emanuel, Theorie der 
Bioh'ichtungsfeller (B. G. Teubner, Leipzig, 1891). 
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of comparison. Several of the more important procedures will 
now be discussed. 

Graphic Compijirison, The simplest method of determining 
whether the assumption of normality is or is not reasonable is 
to graph the histogram and normal curve together and see how 
well the curve fits. The test here is purely a subjective one, 
but in many cases when the fit is exceptionally good or excep¬ 
tionally bad this is probably sufficient for acceptance or rejection 
of the hypothesis. 

In making a graphic comparison of a sample histogram and a 
normal curve, it is necessary to determine what mean and what 
standard deviation should be assigned to the curve. Offhand 
the simplest procedure would appear to be the assignment to 
the curve of the mean and standard deviation of the histogram, 
for presumably these are the best estimates that may be made 
of the mean and standard deviation of the population from which 
the sample was taken. ^ It will be recalled, however, that in the 
calculation of the mean and standard deviation of the histogram 
the data were distributed among various classes or groups and all 
the cases in any class interval were assumed to be concentrated 
at the mid-point of the interval. But the population is pre¬ 
sumably distributed in the form of a smooth curve, so that, in 
estimating its mean and standard deviation from that of the 
histogram, allowance must be made for the grouping of the data 
in the construction of the histogram. In any interval a smooth 
bell-shaped curve, such as the normal curve, will have more cases 
that are on the side toward the mean than on the side away from 
the mean. The assumption that all cases are concentrated at 
the mid-point of an interval will not cause any appreciable error 
in the mean calculated from grouped data, for plus and minus 
deviations will offset each other; but it will cause the standard 
deviation of the grouped data to be greater than the standard 
deviation of the smooth curve that represents the true distribu¬ 
tion of the data. Some adjustment should therefore be made in 
the standard deviation of the histogram before it is taken as an 
estimate of the standard deviation of the population. 

The adjustment that must be made for grouping has been 
determined by W. F. Sheppard. He has shown that under con¬ 
ditions that are true for a normal distribution the variance 

iC/.pp. 318and319. 



SYMMETRICAL BINOMIAL DISTRIBUTION 


299 


of the smooth curve is approximately equal to the variance of 
the grouped data minus one-twelfth the square of the class inter¬ 
val.^ In other words, if AI 2 (uncorrected) is the second moment 
(— cr^) of the grouped data about its mean and /12 is the second 
moment of tho smooth curve about its mean, then^ 

M 2 = M 2 (uncorrected) — -^{iy 

The quantity iV(0^ is Sheppard*s correction for grouping that is 
required for estimating the standard deviation of the fitted 
normal curve. 

In fitting a normal curve to a sample histogram, therefore, the 
mean of the curve is taken equal to the mean of the histogram 
and the variance of the curve is taken equal to the variance of the 
histogram minus In plotting the curve a table of the 

ordinates of the standard normal curve may conveniently be used. 
If the histogram to which the curve is to be fitted is of the usual 
type, that is, if it consists of a series of rectangles of which the 
heights measure aggregate frequencies and if the intervals on 
which these rectangles are erected are laid off in terms of original 
X units, then ordinates of the standard normal curve can be 
taken to represent the particular normal curve desired by making 
certain simple adjustments. The ordinates of the standard 
normal curve, it will be recalled, are given for values of X that 
are measured from the mean of the distribution and are expressed 
in terms of standard deviation units. It will also be recalled that 
the area of the curve over any given interval measures the relative 
frequency of cases falling in this interval. To make these 
ordinates represent a normal curve with a given mean and a 
given standard deviation, they need only be plotted so that the 
ordinate for X = 0 comes at the specified mean value and 
ordinates for other values of X come at X = X ± a*. To 
put them on the same basis as the histogram, however, they must 
also all be multiplied by Ni/(T, This is because the total area 
of the histogram^ is Ni and that of the standard normal curve is 
1 (that is, 100 per cent), whereas the abscissa scale on which the 
histogram is plotted is a times the abscissa scale of the standard 
curve. This use of the ordinates of the standard normal curve 

^ Cf, Proceedings of the London Mathematical Society^ Vbl. 29, 353-380. 

* The area of any one rectangle is Fi, and the total area is therefore 
^Fi = Ni. 
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may be illustrated by fitting a normal curve to the heights of 
300 Princeton freshmen. 

In Table 20 the mid-points of the various class intervals into 
which the 300 heights were distributed are set down in column 
(1). In column (2) the difference between these mid-points and 
the mean of the distribution (X = 70.47) is* computed, and in 
column (3) this is divided by the adjusted standard deviation. 
The results are the various values of x/o- that correspond to the 
mid-points of the various class intervals. The ordinates of the 
standard normal curve at these values of x/a are then computed 
from Table VII (see Appendix, page 694) and entered in column 
(4). Finally, in column (5), these standard ordinates are multi- 
Nt (300) (1) 


plied by — = 


2,47 


= 121.46 to put them on a par Avith the 


sample histogram. 

Test of Goodness of Fit, Another method of comparing a 
sample histogram with a normal curve is to compare the fre¬ 
quencies given by the two, interval by interval. Whereas the 
previous method was primarily subjective in that a conclusion 
had to be reached from a mere inspection of the two graphs, com¬ 
parison of the histogram and the curve, interval by interval, 
yields a numerical criterion of ‘‘goodness of fit.^^ A procedure 
that has found favor because it permits a comparison with chance 
results is to take the difference between the absolute frequencies^ 
given by the curve and by the histogram for each interval, square 
these differences, divide each by the frequency of the curve for 
that interval, and finally sum the results. The quantity so 

2 {F f)2 

- -j - 1 where F repre¬ 
sents for each class interval the frequency given by the histogram 
and f the frequency given by the curve. 

Sampling theory shows^ that, if this quantity is calculated for 
a la,rge number (theoretically, an infinite number) of sample his¬ 
tograms from the same normal population, then the distribution 

of these various sample values of ^ adequately 

represented by a probability curve known as the ‘^x^ curve^^ and 
this can be used to determine the probability of a larger value of 


^ For the curve, this means the relative frequencies times N, the total 
number of cases in the sample. 

* On the X* distribution see Smith and Duncan, Sampling Statistics, pp. 
111-112 and Chap. XIII. 
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Table 20.— Calculation of the Ordinates of the Normal Curve 
That Fits the Distribution of Heights of 300 Princeton Freshmen 


(1) 

(2) 

(3) 

(4) 

(5) 

X 

X - X = X 

X - X 

a 

Ordinate of ^ 
standard curve 

Col. (4) X — 

O' 

62.5 

-7.97 

-3.22 

0.00224 

0.27 

63.5 

-6.97 

-2.82 

0.00748 

0.91 

64.5 

-5.97 

-2.42 

0.02134 

2.59 

65.5 

-4.97 

-2.01 

0.05292 

6.43 ■ 

66.5 

-3.97 

-1.59 

0.11270 

13.69 

67.5 

-2.97 

-1.19 

0.19652 

23.87 

68.5 

-1.97 

-0.80 

0.28969 

35.19 

69.5 

-0.97 

-0.39 

0.36973 

44.91 

70.5 

0.03 

-0.01 

0.39892 

48.45 

71.5 

1.03 

0.42 

0.36526 

44.36 

72.5 

2.03 

0.82 

0.28504 

34.62 

73.5 

3.03 

1.22 

0.18954 1 

23.02 

74.5 

4.03 

1.63 

0.10567 

12.83 

75.5 

5.03 

2.04 

0.04980 

6.05 

76.5 

6.03 

2.44 

0.02033 

2.47 

77.5 

7.03 

2.84 

0.00707 

0.86 


X = 70.47 <r (corrected) = 2.47 




{F - fy 
f 


by chance. If the probability is a large one, then 


the difference between the given sample histogram and the 

^ (TT _ f)2 

normal curve, as measured by - j -> may reasonably be 

attributed to chance; the curve may be deemed a good fit, and 
the population from which the sample was drawn may tenta¬ 
tively be taken as normal. If the probability is very small, 
however, say less than 0.05, then the difference between the 
histogram and the curve is to be attributed to something else 
than chance, presumably to the nonnormality of the population 
from which the sample was drawn. In this case, the normal 
curve is not deemed a good fit, and the hypothesis of a normal 
population is rejected. Owing to its use of the curve, this 
second method of comparison is called the test of goodness 
of fit.’’ 

The x^ test may be illustrated, as in the previous case, by the 
distribution of heights of 300 Princeton freshmen. The numeri- 
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cal calculations involved are set forth in Table 21. In column (1) 
are put the upper limits (not the mid-points in this case) of the 
various class intervals of the histogram of height^s, the first and 
last intervals being considered to run to — oo and + respec¬ 
tively. The deviations of these class-interval limits from the 
mean of the distribution are calculated in column (2), and their 
ratio to the “corrected^’ standard deviation is computed in 
column (3). In column (4) are written the areas under the 
standard normal curve from its lower limit (— <») to each of 
these class-interval limits, now measured in standard-deviation 
units. These areas are obtained from Table VI, page 693, 
of the Appendix. The figure in column (4) already gives the 
area for the first interval (— oo to 63), and the areas for the other 
intervals can be computed by taking the difference between 
the area under the curve up to one class limit and the area up 
to the next higher class limit. These differences are written in 
column (5). In order to avoid very small areas in the extreme 
intervals (and hence a distortion of the test^, several groups at 
each end are amalgamated so that the areas for the first and last 
interval are at least equal to 5/A. 

The new arrangement is indicated in columns (!') and (5'). 
Now it will be noted that the figures of column (5') represent 
the relative frequencies given by the curve. To convert them 
to aggregate frequencies that are comparable with the aggre¬ 
gate frequencies of the histogram, it is necessary to multiply 
them by V ( = 300), the total number of cases. This is done 
in column (6). In column (7) are written the histogram 
frequencies for the intervals of column (!'), and in the remaining 
columns the differences between the histogram and curve fre¬ 
quencies are computed, squared, and divided by the curve fre¬ 
quencies. The sum of the last column is the value of 
desired. In the present instance this is found to be 3.867. 


V 


To determine whether the value of 




= 3.867 


1 epresents a good fit or not, turn to Table 22, Here are given 

S (F -/)2 

-The n of 


1 See tbid.j p. 140. 
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* The items in this column are obtained by subtracting from 0.50000 the figures found for each — (A'— X ;/<r and by adding to 0.50000 the figures 
found for each in Table VI of the Appendix, p. 693. 
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the first column is equal to the number of class intervals minus 3. ^ 

(p ^ fy 

The figures in the second column represent values of ^- j - 

for which there is a probability of 0.05 that an equal Or greater 
value would be obtained by mere chance. For example, in the 
present instance, n = ll—3 = 8, and Table 22 shows that, if 

YV (F — /)2 

the data were truly normal, sample values for ^- j -that 

were equal to or greater than 15.51 would be obtained only 
5 times out of 100 for such a value of n. Since the computed 


value of 


= 3.867, the chances of an equal or greater 


^ (F ^ f)2* 

Table 22.— C-ritical Values for 2^ — y 

Values of ^ f 

Which the Probability of an 
Equal or Greater Value Is 
n Just 0.05 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

♦Abridged from Table III, Table of i 
Workers, Oliver & Boyd, Ltd., Edinburgh, 
author. 


3.84 

5.99 

7.81 

9.49 

11.07 

12.59 
14.07 
15.51 
16.92 
18.31 

19.67 
21.03 
22.36 

23.68 
25.00 
26.30 

27.59 
28.87 
30.14 
31.41 

I R. A. Fisher, Statistical Methods for Research 
by the kind permission of the publishers and 


^ See Smith and Duncan, Sampling Statisticsj pp. 327-328, for an explana¬ 
tion of the significance of n in this case. 
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value is much more than 0.05. Hence the curve is deemed a 
good fit, and the distribution of heights may be said to be normal. 

Comparison of Special Statistics, Although the test just out¬ 
lined is very commonly used, it has certain weaknesses as a test 
of normality. (1) It should be noted that the squaring of the 
differences between the group frequencies removes any signifi¬ 
cance that might be attributed to the signs of the differences. 
For example, it might happen in a given case that all the histo¬ 
gram frequencies to the left of the center were larger than the 
normal curve frequencies and that all the histogram frequencies 
to the right of the center were less than the normal curve fre¬ 
quencies, indicating a well-marked positive skewness; neverthe¬ 
less, if the absolute values of these differences were all small, the 

test might not indicate any departure from normality. (2) 
The necessity of combining the extreme intervals into larger 
groups causes a loss of information and reduces the number of 
points of comparison.^ For these reasons, other methods of test¬ 
ing for normality have been proposed. 

If a set of sample data actually has come from a normal popula¬ 
tion, it is to be expected that its skewness will be slight and its 
kurtosis close to the normal kurtosis of 3. It would also be 
expected that the ratio of its average deviation to its standard 
deviation would be somewhere in the neighborhood of the value of 
this ratio for the normal curve {i.e., 0.7979). The departure of 
the actual values of these sample statistics from the theoretical 
values for the normal curve can thus be used as a test for normality. 

For the 300 Princeton freshmen, /?!, ^ 2 , and the ratio of average 
deviation to standard deviation (indicated by the symbol a) had 
the values^ Pi = 0.023, P 2 = 3.021, and a = 0.805. These are 

^ Its practical effect is to reduce the value of n to be used in the table. 

2 No account was taken of Sheppard^s correction in computing these 
values. The average deviation used in making this test was computed from 
the mean by the formula 

= F [2) 1^ © I + (4 + 

Jf — A 

where c =-:-» Ni — number of cases in intervals below the arbitrary 

% 

origin, Nu = number of cases in intervals above the arbitrary origin, and 
iVo = number of cases in interval containing arbitrary origin. (A must be 
in the same interval as X.) Cf, Geary, R. C., and E. S. Pearson, Tests of 
Norrmlity, p. 4. 
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all very close to the values 0, 3, and 0.7979 of a truly normal dis¬ 
tribution. Hence this last, as well as the other tests, suggests that 
heights are normally distributexi. 

Sometimes the sample values of /3i, ^ 2 , and a arc not so close to 
the normal values as in the foregoing illustration. In such 
instances, use may be made of tables published in Tests of Normal¬ 
ity,^ by R. C. Geary and E. S. Pearson. These tables give, for 
various-sized samples, the sample values of fii, P 2 , and a, for 
which the probability of a greater value is 0.05, and 0.01, respec¬ 
tively. For ^2 and a they also give values of these statistics for 
which the probability of a smaller value is 0.05 and 0.01, respec¬ 
tively. If, in any given instance, the sample value of fii, ^ 2 j or a 
falls outside the limits given for a probability of 0.05, say, then it 
may be concluded that the population from which the sample was 
drawn was not strictly normal. For the weights of the 300 
Princeton freshmen, for example, the sample values of and ^2 
were 0.378 and 4.606. Both these are beyond the 0.01 probability 
point given by Geary and Pearson^s tables for a sample of 300 
(these were 0.329 and 3.79, respectively), and it may therefore be 
concluded that the distribution of weights is definitely not normal. 

1 Issued by the Biometrika Office, University College, London, and 
printed at the University Press, Cambridge, England. 



CHAPTER XII 


USE OF THE NORMAL FREQUENCY CURVE IN SAMPLING 

ANALYSIS 

The normal frequency curve has its greatest usefulness in the 
theory of random sampling.^ While the full exposition of the 
theory of random sampling is beyond the scope of this book, some 
of the simpler aspects that relate to the use of the normal curve 
in sampling analysis are presented in the ensuing pages of this 
chapter. 

SAMPLING FROM A TWOFOLD POPULATION 

The Problem. An elementary problem in the theory of sam¬ 
pling is concerned with sampling from a twofold population. 
Consider the following problem: Suppose a large city is undergoing 
a fiercely contested election. The Radicals on the one hand and 
the Conservatives on the other are contending hotly for the 
mayoralty, and everyone in the city takes a stand on one side or 
the other. The voting population of the city thus forms a group 
in which a certain percentage are Radicals and a complementary 
percentage are Conservatives. Prior to the election these per¬ 
centages will not be known. They may, however, be estimated 
by taking a random sample. The inferences that may be made 
from such a random sample constitute the statistical problem 
that will now be analyzed. 

Sampling Distribution. For the sake of argument suppose that 
some omniscient being knew how each individual in the city stood 
politically. Suppose that he noted their positions on slips of 
paper—one for each individual—and put the slips into a large 
urn. Suppose, further, that there are actually an equal number 
of Radicixls and Conservatives. Let the omniscient being mix the 
slips of paper thoroughly and then draw out a sample of 100 slips. ^ 

^ For more elaborate exposition than is contained in this chapter, see 
Smith and Duncan, Sampling Statistics, Parts II and III. 

^ Mundane methods of obtaining random samples are discussed in ibid. 
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Let him note the division of opinion for this sample, put the slips 
back, and thoroughly mix them again. Finally, let him repeat 
this process many times, taking a sample of 100 each time, so that 
he eventually accumulates a large number of sample percentage 
divisions of opinion. Many, but by no means all, of these sample 
percentages will be the actual population percentage of 50; the 
others will be distributed above and below the 50 per cent level. 
This will be the sampling distribution of the sample percentages. 

It is one of the important conclusions of the probability theory, 
based upon the analysis of the preceding chapters,* * * § that the 
outcome of this process of random sampling will be a set of sam¬ 
ples in which the relative frequency of samples in which the 
division of opinion is 0 per cent Radical, 10 per cent Radical, 20 
per cent Radical, 30 per cent Radical, . . . , 100 per cent Radical 
will be approximately the same as the probabilities of a binomial 
distribution in which pi = 0.50 and p 2 = 0.50 and N = 100. f In 
other words, relative frequencies of the sample percentages may 
be estimated a priori by means of the probability calculus. 
Furthermore, since the size of the sample is large (N = 100), the 
calculation of the probabilities can be simplified by using the 
normal curve as an approximation to the binomial distribution. J 
In this problem, the curve will have a mean of 50 per cent, 
because the population is equally divided between Radicals and 
Conservatives by hypothesis, and a standard deviation equal to 
5 per cent.§ The normal curve, with a mean of 50 per cent 
and a standard deviation of 5 per cent, thus gives approximately 
the “sampling distribution^' for sample percentages taken from 
a population in which the division of opinion is exactly 50 per 
cent; and this is the sampling distribution of sample percentages 
conceived in the preceding paragraph. 

The foregoing result is not limited, however, to cases in which 
the actual division of opinion in the entire population is exactly 

* See also ibid, 

t When the symbol for a sample statistic is in boldface type, it refers to 
the corresponding population parameter; thus here pi and p2 refer to the 
population values for which pi and p 2 are corresponding sample statistics. 

t See pp. 283-290. 

§ When the variable is expressed as a percentage instead of as an absolute 
deviation from an integral mean value, the formula for the standard devia¬ 
tion is O’ per cent “ \/(0.5)(0.5)/J\r. C/.p. 283. 
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fifty-fifty but may be shown to be valid for any division of opinion 
in the population.^ Thus if the percentage of Radicals in the 
population is pi and the percentage of Conservatives is p 2 (where 
pi p 2 == 1) and samples of size N are drawn at random from this 
population, with replacements as above, then the relative fre¬ 
quencies of various sample percentages of Radical opinion will be 
given approximately by the probabilities of a normal frequency 
curve whose mean is Npi and whose standard deviation is 
Vpih/N. 

This conclusion is of capital importance in making inferences 
about a population from which a single random sample has been 
drawn, as will now be demonstrated. 

Statistical Inferences from Samples. Types of Inference. In a 
real instance, no omniscient being is available to record every¬ 
one’s opinion. Prior to the actual election, the only practical 
way of determining the division of opinion is to take a random 
sample from the population. This may be done by stopping peo¬ 
ple on the street, ringing doorbells, sending out letters, or the 
like. When the results of the sample poll are counted, they may 
be used to draw inferences about the true division of opinion in the 
population in three ways—that is to say, three types of inference 
may be dra^vn. (1) A certain hypothesis regarding the true 
division of opinion may be tested as to its reasonableness in the 
light of the sample results and either rejected or accepted. 
(2) So-called ‘‘confidence limits” may be set up for which it may 
be said that there is a given probability that these limits include 
the true value. (3) A best single estimate may be made of the 
population percentage; this is called an “optimum estimate.” 
Each of these three types of inference will now be studied. 

Testing a Hypothesis as to the Population Percentage. Let the 
hypothesis be set up that the population is evenly divided 
between Radical and Conservatives. Suppose the sample poll of 
100 voters shows 57 Radicals and 43 Conservatives. Although 
the sample shows a percentage in favor of the Radicals, it is 
possible, of course, that it may be misleading. Almost any result 
might be yielded by a single sample, whatever the population. 
ILthe population consisted even of 999,900 Conservatives and 100 
Radicals, it would still be possible for a random sample of 100 to 

^ For proof of this, see Smith and Duncan, Sampling Statistics, pp. 
186-190. 
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consist of all Radicals. Such a result would not be very proba¬ 
ble, however, and the reasonableness of any hypothesis must be 
judged by the probability of the sample result on the assumption 
that the hypothesis is valid. 

The general procedure for testing the hypothesis is as follows: 
First, the risk that is to be allowed in rejecting a given hypothesis 
when it is in fact true must be decided upon. The ‘‘coefficient of 
risk,’' as it is called, is commonly, but not necessarily, set at 0.05. 
In other words, it is the common practice to run the risk of reject¬ 
ing a hypothesis 5 times out of 100 when it is in fact true. When a 
sampling distribution is normal, this is often done by saying that a 



Fig. 102.—Sampling distribution of sample percentages of 100 votes. 

given hypothesis will be rejected if the sample result falls beyond 
±2d from the mean value given by the hypothesis.^ In the 
present instance, the hypothesis that the true division of opinion 
is fifty-fifty suggests that random samples of 100 taken from such 
a population will have a mean percentage of Radical votes equal to 
p\ = 60 per cent and a s tandard deviat ion of sample percentages 
equal to V pip^/N = y/ (0.5)(0.5)/100 = 5 per cent. 

Accordingly, 95 per cent of the sample percentages would fall 
between 50 per cent ±2X5 per cent, or between 40 and 60 per 
cent; 6 per cent of sample percentages would fall below 40 and 
above 60 per cent. Hence, if this hypothesis is rejected when a 

^ The desirability in some cases of using regions of rejection that fall all 
above or all below the mean are discussed in ibid., pp. 196-201. 
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single sample return jdelds a percentage of Radical vote below 
40 or above 60 per cent, then the hypothesis would in many 
sample polls be rejected only 5 per cent of the time when it was 
actually true. In other words, the rejector would be wrong only 
1 out of 20 times in a large number of tries. 

For the given problem, suppose the coefficient of risk is put at 
0.05. Since the sample return is 57 Radical votes out of a total 
of 100, the hypothesis of an equal division of opinion is not to be 
rejected, for the sample result does not fall in the region of rejec¬ 
tion below 40 or above 60 per cent. In this instance, the sample 
result does not deviate sufficiently from the hypothetical per¬ 
centage to cause its rejection. If the sample return had been 
62 Radicals and 38 Conservatives, however, the hypothesis of 
an equal division of opinion would have been rejected and it would 
have been concluded that the Radicals were in the majority. 
This argument and these conclusions are illustrated graphically 
in Fig. 102. 

From the figure it is seen that with a sample result of 57 per 
cent the hypothesis that pi = 0.5 is accepted while with a sample 
result of 62 per cent the hypothesis that pi = 0.5 is rejected. 

Determining Confidence Limits for Population Percentage. 
Before confidence limits can be established for a population 
percentage it is first necessary to decide upon the degree of con¬ 
fidence that is to be placed in the computed limits. This is 
usually determined by so choosing the limits that the probability 
of their including the true percentage equals an agreed-upon 
figure, called the ‘‘confidence coefficient.^' For example, if the 
confidence coefficient is set at 0.95, as is the common practice, 
then the limits will be so chosen that the probability of their 
embracing the true value is just 0.95. 

In the case of a normal sampling distribution, confidence 
limits with a confidence coefficient of 0.95 may be set up as fol¬ 
lows: Choose as the upper confidence limit a value for the popula¬ 
tion percentage that, if it were the true value, would make the 
probability of getting the given sample value or a lower sample 
value just equal to 0.025. Since the sampling distribution is 
normal, this upper limit may be obtained by choosing pi so that 
the sample value of 57 per cent falls at ~2d from the mean value 
of the sample percentage, i.e., at —2d from pi. The mathematical 
equation becomes 



'4 


^Pip2 

N 
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0.57 -^1 = -2. 
or, since = 1 — Pu 

0.57 - pi= -2 . 

When solved for pi, this becomes 


^4 


Ipid - pi) 


N 


Pi = 


0.57 + + 2 


(0.57) (0.43) 1 

N 


1 + 


N 


When N is large, as it must be if the normal distribution is to be 
used as an approximation to the binomial distribution, the terms 
2/Nf 4:/Nf and l/N^ can be dropped from the above equation 
without materially affecting the result. In this approximate 
form it becomes 


pi = 0.57 + 2 


4 


(0.57) (0.43) 
100 


0.67 


In effect, this indicates that the upper confidence limit can be 
found approximately by adding to the sample percentage twice 
the standard deviation of the sampling distribution, computed 
with the sample percentage in place of the hypothetical popula¬ 
tion percentage. In general, if Pi is taken as the sample per¬ 
centage (note that sample statistics are printed in text type and 
the corresponding population parameters in boldface type), the 
upper confidence limit of the population percentage is given by 

^1 = Pi + 2 ^ (1) 


This is shown graphically in Fig. 103a. 

In a similar manner, the lower confidence limit is given approxi¬ 
mately by the formula 

A _ « 9 

* — Pi "" 2 ^ 


( 2 ) 
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For the given instance, in which Pi = 0.57, this lower limit is 

This is shown graphically in Fig. 103fc. 

How the upper limit is determined, how the lower limit is 
determined, and the resulting range or total interval between the 
confidence limits are pictured graphically in Figs. 103a, 6, and c. 



Fig. 103.—Diagram showing how the limits of the confidence interval are 

determined. 

The limits of the range are 0.47 to 0.67. This is known tech¬ 
nically as the ‘‘confidence intervaP^ and is shown in Fig. 103c. 
Owing to the manner in which the confidence limits were derived, 
it may be said that there is a probability of 0.95 that this con¬ 
fidence interval includes the true population percentage. By this 
is meant that, if confidence intervals were set up like this from 
many samples, 95 per cent of them would include the true 
population percentage. 

An Optimum Estimate of the Population Percentage, Up to 
this point in the argument, a particular hypothesis regarding the 
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population has been tested and a method of setting up confidence 
intervals has been devised. A final problem of statistical 
inference is to indicate a method of making a single best estimate 
of the population percentage from the given sample. Various 



Values of/^ 

Fig. 104.—Diagram showing relationship between probability of sample and 
likelihood of population percentage. 

methods are employed, but the one that has received consider¬ 
able prominence in recent years and that will be employed here 
is the method of maximum likelihood. 

When a population percentage is given, the probabilities of 
various sample results may be determined from the sampling 
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distribution of sample percentages, in this case, approximately 
from the normal frequency curve. The analysis here runs from 
a given population percentage to probabilities of various sample 
results. When a particular sample result is given, however, it 
is possible to determine the probabilities of obtaining this sample 
result from various hypothetical values for the population per¬ 
centage. Here the analysis runs from a given sample percentage 
to the probabilities of obtaining the particular sample from 
various hypothetical population percentages. In the latter 
analysis, the logarithm of the probability of the given sample 
result for a particular value of the population percentage is 
called the ‘likelihoodof the population percentage. 

As shown in Figs. 104a to 104c, these likelihoods will vary for 
different hypothetical values of the population percentage. The 
value of the population percentage that has the maximum likeli¬ 
hood is considered the best, or optimum, estimate of the popula¬ 
tion percentage; this is shown in Fig. 104d. Figures 104a to 
104c show graphically how the likelihoods of various population 
percentages (or, more exactly, their antilogs) vary with changes 
in the hypothetical values for these percentages. These various 
results are summarized in Fig. 104d, which, if completed for a 
large number of hj^pothetical values of the population per¬ 
centage, would become a smooth curve showing the variation 
in the antilogs of the likelihoods of pi with changes in pi. It is 
to be noted that the maximum point of this curve is also the 
point of maximum likelihood, since a logarithm is a maximum 
when its antilog is a maximum. 

Without undertaking the mathematical analysis involved,^ 
it may be pointed out that the value of pi which has the maximum 
likelihood is the value for which pi = p^. In other words, 
the maximum likelihood estimate of a population percentage is the 
percentage yielded by a given sample. This then becomes the 
best estimate of the population figure; that is to say, the sample 
percentage is the optimum, or best, estimate of the population 
percentage. 

SAMPLING OF MEANS AND VARIANCES 

Sampling Distribution of Means and Variances. The Mean. 
Most of the preceding analysis applying to sample percentages 

^ For such analysis, see ibid., pp. 208-209. 
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applies equally well to means of samples from a continuously 
distributed population. If the population is normal in form, it 
can readily be demonstrated that means of samples from such a 
population will form a frequency distribution which is also normal 
in form, the mean of which is the mean of the population and the 
variance of which is the variance of the population divided by the 
size of the sample. 

If the population is not normal, the sampling distribution of 
sample means nevertheless tends to be normal, with a mean 
equal to the mean of the population and a variance equal to the 
variance of the population divided by the size of the sample.^ 

Accordingly, the equation for the standard deviation of the 
sampling distribution of sample means is as follows: 



( 3 ) 


This is conventionally called the ‘‘standard error'' of the mean.^ 
The Variance. If samples are taken from a normal population, 
the sampling distribution of sample variances is not normal for 
small samples but approaches the normal form as the samples 
become larger, say larger than 30 cases. The mean of this 
normal distribution is the variance of the population, and the 
standard deviation of the sampling distribution is the variance 
of the population multiplied by y/2IN. 

It is to be noted that, if the population is not normal, the 
sampling distribution of sample variances may' not become 
normal, even for relatively large samples. Hence the use of 
the normal curve for making inferences about a population 
variance when the population is not normal may be an unwise 
procedure, even when the sample is large. 

But for variances of large samples taken from normal popula¬ 
tions, the standard error of the variance is given by 



1 Ibid.y p. 164. 

* Standard errors are printed in boldface type because they represent the 
standard deviations of the populations of all possible sample statistics of 
the type in question. Thus dj: is the standard deviation of all possible 
sample X*s. 
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The Standard Deviation, For standard deviations of large 
samples taken from normal populations, the standard error of 
the standard deviation is given by 



( 6 ) 


Inferences about Population Means and. Variances. Since 
the sampling distribution of sample means tends to be normal 
in form and the same is true of the sampling distribution of 
variances and standard deviations, if the population is normal, 
it follows that the normal curve can be used to make inferences 
about the population values of these parameters from correspond¬ 
ing sample statistics. 

Testing a Hypothesis about the Population Mean, To illustrate 
how a hypothesis about a population mean may be tested, con¬ 
sider the following example. Suppose it is claimed that the 
mean length of life of a certain make of shoe (with constant wear) 
is 11.5 months. A random sample of 100 shoes is tested, and 
it is found that the average length of life of this sample is 10.8 
months. The standard deviation of the sample is 1.2 months. 
Do these sample results warrant the rejection of the claim of a 
true mean value of 11.5 months? 

To answer this question, proceed as follows: Let the risk of 
rejecting a hypothesis when it is true be set at 0.05. Then cal¬ 
culate the standard deviation of the sampling distribution of the 
mean (the ‘^standard errorof the mean, as it is called) from 
Eq. (3). Since the standard deviation of the population is not 
known in this instance, the standard deviation of the sample 
must be used in its stead. ^ 

The value of 6x for the given problem is accordingly 

1 2 

dx = — 7 ^ = 0.12 month 

Vm 


Next, calculate the difference between the hypothetical value 
of the mean and the sample value of the mean. This is 


10.8 - 11.5 = 0.7 month 


Finally, compare this difference with the standard error of the 
mean. If the difference is more than twice the standard error, 

^ This substitution does not materially affect the analysis when the 
sample is large. For further discussion, see ibid,, pp. 273-284. 
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the hypothesis will not be accepted. In the present instance 0.7 
is over five times greater than 0.12; so the claim that the true 
mean is 11.5 is rejected. The sample mean deviates too greatly 
from the hypothetical mean for the latter to be accepted as 
reasonable. 

Confidence Limits for the Population Mean. Confidence limits 
for the true mean with a confidence coefficient of 0.95 will be 
obtained by laying off 2dy plus and minus from the sample 
value. Thus, in the present problem these limits will be 
10.8 ± 2(0.12) = 11.04 and 10.56. Accordingly it can be said 
that there is a probability of 0.95 that the interval from 10.56 to 
11.04 includes the true population mean within its range. 

Optimum Estimate of the Population Mean. If the method of 
maximum likelihood is used to give the best estimate of the 
population mean, it is found that the sample mean is the maxi¬ 
mum likelihood estimate of the population mean. Hence, in 
the present instance the best estimate of the population mean 
is 10.8 months. 

Testing a Hypothesis about the Population Variance. The 
same analysis can be applied to inferences regarding population 
variances from sample variances. Suppose it is claimed that 
the true variability in the life of the given make of shoes is 
1.0 month. As in the case of the mean, this hypothesis may be 
tested by comparing the hypothetical value with the standard 
deviation of the sample of 100 shoes, which it will be assumed is 
1.2 months. 

The variances, or squares of the standard deviations, are 1.0 
and 1.44 square months, respectively. Their difference is 
1.44 — 1.00 = 0.44 square month. The standard deviation of 
the sampling distribution of sample variances, i.e., the standard 
error of the sample variance, is 

Since the difference between the hypothetical value and the 
sample value is more than three times (0.44/0.14 = 3+) the 
standard error of the sample variance, the hypothesis must again 
be rejected.^ 

^For more exact methods especially applicable to small samples, see ihid.j 
pp. 284-287. 
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If it were desired to test a hypothesis about the standard devia¬ 
tion, rather than about the variance, Eq. (5) would be used. In 
the present instance, the population standard deviation is hypo¬ 
thetically set at 1.0 month and the standard deviation in the 
sample is 1.2 months; the difference is 0.2 month. Using Eq. (5), 
the standard error of the standard deviation in this problem is 
found to be 


da 


1.00 


0.07 


Since the difference is almost three times the standard error, the 
hypothesis is rejected as unreasonable. 

Confidence Limits for the Population Variance. Confidence 
limits for the population variance with a confidence coefficient of 
0.95 are given by 


= (T^ rb 2da2 

= 1.44 ± 2(0.14) = 1.72 and I.IG 


it can thus be said that there is a probability of 0.95 that the 
interval from 1.16 to 1.72 includes the true variance. The cor¬ 
responding interval for the population standard deviation is from 
1.06 to 1.34, obtained by making use of Eq. (5). 

Optimum Estimate of the Population Variance. Finally, the 
maximum likelihood estimate of the population variance is (for 
large samples) approximately the variance of the sample.^ 
Hence the best estimate of the population variance in this instance 
is 1.44, which gives a population standard deviation of 1.2. 

CONCLUSION 

From the few illustrations in this chapter, it should be clear 
that the normal curve is very useful in making inferences about 
populations from random samples. It can be used to measure 
sampling fluctuations in sample percentages, sample means, and, 
in certain instances, sample variances, as well as in a number of 

N 

^ For small samples the multiplier ■■-- "i should be applied to the sample 

v^ariance to give a better estimate of the population variance. Thus the 
optimum estimate, if N is small, say less than 30, is as follows: 



Cf. ihid.j pp. 290-294. 
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other statistics. It also has many uses in more advanced sam¬ 
pling analyses and is probably the most important sampling 
distribution that occurs in statistical theory. 

Table 23 contains not only the standard errors discussed in 
this chapter but also the standard errors for a number of other 
statistics. The method of applying these formulas to test 
hypotheses, to set up confidence intervals, or to obtain optimum 
estimates is similar for all statistics obtained from large samples. 

Table 23. —Sampling Errobs in Elementary Statistics for Which the 
Sampling Distribution Approximates the Normal Curve 
(Ordinarily these formulas for standard error cannot be used for N < 30) 


Statistics 

Standard errors 

X 




<T 

dff = — 

V2N 

c,, V^(^,+3) 

2(5^2 - 6^1 - 9) 

^ 1.225 * 

132 

^ ./M * 

Mi 

= 1.25331 * 

VN 

A.D.Mi 

dA.D. == 0.605d** 

z(r = tanh z) 

il 

< 

1 

00 

0i 

Qz 

d * 

dQi = dQj — 1.36263 —-prz 
i viv 

Pi 

\ JV , 

Ofi.jh ,.. n 

. di.,•*...« 

hij.k, . . n 

__ di.jk,..n 

di.ik,..nVN 


C/. WaVSH, Ai^isbt M0fnent9 of M^hod (1938), pp. 142-147. 















PART IV 


Study of Bivariates and Multivariates 

CHAPTER XIII 

SIMPLE CORRELATION 

CORRELATION FUNDAMENTAL TO KNOWLEDGE 

Progressive development in the methods of science and philos¬ 
ophy has been characterized by increase in the knowledge of 
relationships, or correlations. Nature has been found to be a 
multiplicity of interrelated forces. The phenomena of the 
physical Avorld outside man seem to be well adapted to this 
coiJlept of interrelationship. The same is true with respect 
to phenomena having to do with human beings and their 
environment. 

Progress in the Discovery of Correlation, In the physical 
sciences, where the laws of nature are, within certain limits, 
determinate, experimental method has sufficed to disclose innu¬ 
merable relationships. Many of these physical correlations have 
become definitely known as cause and effect relationships.'^ To 
some degree, too, this is true of biology, anthropology, geology, 
and the like. In these fields of study, great progress was made 
possible by the use of observation of ‘‘cases," by tracing cor¬ 
relations previously known or suspected, and by laboratory 
experiments. In the, social sciences, however, the establishment 
of certain knowledge, or knowledge of a high degree of probability 
regarding relationships, is a more difficult problem; and little 
scientific progress, comparatively speaking,' has been made 
through the speculative method. This is particularly true so far 
as cause and effect relationships are concerned. 

For example, philosophical speculation, based upon qualitative 
or semiquantitative observation of experience seemed to many 
economists of the eighteenth, nineteenth, and twentieth centuries 

- 321 
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to have codified the relationship between money and credit on the 
one hand and prices and many social problems on the other hand. 
But no such certainty among these social scientists now exists as 
to the nature of the cause and effect order of events. In its 
earlier conception, the principle of the quantity theory of money 
seemed to be one of extraordinary simplicity and determinate¬ 
ness; but the more it is studied in its quantitative aspects the 
more complicated it is found to be in reality. By the 1930’s 
and 1940^s, the world of scientific monetary theorists came to be 
characterized by confused controversy. The practical world still 
awaits their solution of the theoretical problem in order to make 
possible a world-wide solution of the problem of monetary reform. 
Some sa 3 ^ that increases or decreases in the quantity of money 
cause rising and falling prices, respectively; but others, with con¬ 
vincing argument, maintain that rising prices cause an increase in 
the quantity of money, and vice versa. It is a moot question as 
to whether or not statistics can come to the rescue in the matter 
of deciding the direction of the cause and effect relationship; 
but at least the technique has been developed to disclose the 
facts of relationship more precisely than was ever b^ore 
possible. 

By the latter half of the nineteenth century, in many fields of 
study, a point had been reached where speculation concerning 
relationships could advance no farther with the existing tech¬ 
niques. More exact measurement of relationship was needed. 
Many questions in biology, anthropology, and the social sciences 
generally awaited a scientific answer to the question: How can 
relationship be measured? Two interesting attempts were made 
by American scholars to devise a method of measuring relation¬ 
ship, one in 1877 and the other in 1892.^ Credit for the discovery 
of a method, and for its subsequent mathematical development, 
however, belongs largely to the scholars of England. 

Origin and Development of the Measurement of Correlation, In 
the nineteenth century pre-Darwinian and Darwinian doctrines of 

iBowDiTCH, H. P., “The Growth of Children,” Eighth Annual Report 
of the State Board of Health of Massachusetts (1877), pp. 275-324; Bbyan, 
W. L., “On the Development of Voluntary Motor Ability,” American 
Journal of Psychology j Vol. 6 (1892), pp. 123-204. These are both described 
in Helen M. Walker, Studies in the History of Statistical Method (1929), pp. 
100-102, 109-110. 
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evolution were taking root, and the question of the influence of 
heredity vs. environment upon human characteristics was in a 
state of rarefied speculation and controversy. The experimental 
data appeared chaotic and amenable to as many interpretations 
as there were interpreters. 

One of the great nineteenth-century students of the problem of 
heredity was Sir Francis Galton. He had been profoundly 
impressed by Darwin^s Origin of Species (1859), concerning which 
he said,^ “Its effect was to demolish a multitude of dogniatic 
barriers by a single stroke, and to arouse a spirit of rebellion 
against all ancient authorities whose positive and unauthenticated 
statements were contradicted by modern science.Galton made 
numerous studies on the subject of heredity. The question that 
was motivating his studies was: How is it possible for a whole 
population to remain alike in its features, as a whole, during many 
successive generations, if the average produce of each couple 
resemble their parents? He attacked the question by studying 
sweet peas, moths, hounds, and finally the records of human 
families, which he obtained by offering prizes. 

Between the years 1877 and 1889, Galton worked out a mathe¬ 
matical method by which he could give an exact measure of the 
relationship between, for example, heights of children and the 
average heights of their parents. By statistical measurement he 
found that, if the stature of a group of parents is found to be, say 
y inches above or below the general average of the race, the aver¬ 
age stature of their children will be only inches above or below 
the average of the race; and he induced the law that the mean 
heights of offspring tend to “regress back toward the mean of the 
race” in spite of the strong hereditary influence of the parents. 
This is the famous law of regression to type, although the exact 
figure ^ is not to be taken as final. 

The method Galton used was based upon the median and 
quartiles and has not been generally followed in subsequent work. 
In the 1890’s another method, based on the arithmetic mean 
and the standard deviation, was devised by Karl Pearson. His 

1 ** Hereditary Talent and Character,” Macmillan^s Magazine, Vol. 12 
(May, 1865-October, 1865), pp. 157-166, 318-327; Hereditary Genius 
(1869, 2d ed. 1892); English Men of Science (1874); Human Faculty (1883); 
Record of Family Faculties (1884); Life History Album (1884); Natural 
Inheritance (1889). C/. Walker, op. dt., pp. 102-103. 
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method has been widely adopted and is known as the ‘‘ Pearsonian 
coefficient of correlation.”^ 

It should be pointed out that in the fields of meteorology and 
astronomy mathematicians had previously worked out a formula 
for a joint or bivariate normal frequency distribution. This 
gave the probability of the simultaneous occurrence of two errors 
of observation but did not directly indicate a measure of correla¬ 
tion between them. Work in this field was more concerned with 
the simultaneous occurrence of independent errors than of 
correlated errors. ^ Galton, as already indicated, was primarily 
concerned with the problem of correlation, and it remained for 
Karl Pearson and others to combine the work of Galton and the 
work of the mathematicians into a unified theory of correlation. 
Pearson’s development of the theory of correlation will be 
explained on page 338 to 349. 

Applications of the Method by Social Scientists. As early as 
1901, R. H. Hooker, using the Pearsonian coefficient, studied 
correlation between marriage rates and trade. He correlated 
marriage rates with per capita exports of England, with per 
capita imports, and with other trade events.® In 1906, G. 
Udny Yule likewise made a study of correlation between mar¬ 
riage rates and trade. He also correlated trade activity with 
birth rates and death rates but found little correlation between 
them,^ 

^C/. Walker, op. cit., pp. 110-115; Pearson, Karl, Notes on the 
History of Correlation,'' Biometrika, Vol. 13 (1920-1921), pp. 25-45, where 
he cites W. F. R. Weldon, “Variations Occurring in certain Decapod Crus¬ 
tacea— I. Crangon vulgaris,Proceedings of the Royal Society of London, Vol. 
47 (1890), pp. 445-453; Weldon, W. F. R., “Certain Correlated Variations 
in Crangon vulgaris,'^ Proceedings of the Royal Society of London, Vol. 51 
(1892), pp. 2-21; Yule, G. U., “On the Theory of Correlation," Journal of 
the Royal Statistical Society, Vol. 60 (1897), pp. 812-850. 

* Pretorius, S. J., “Skew Bivariate Frequency Surface, Examined in the 
Light of Numerical Illustrations," Biometrika, Vol. 22 (1930-1931), pp. 
109-^223; Pearson, Karl, “The Contribution of Giovanni Plana to the 
Normal Bivariate Frequency Surface," Biometrika, Vol. 20A (1928), pp. 
295-298; Walker, Helen M., “The Relation of Plana and Bravais to the 
Theory of Correlation," Isie, Vol. 10, No. 34 (1938), pp. 466-484. 

® “Correlation of the Marriage-rate with Trade," Journal of the Royal 
Statistical Society, Vol. 64 (September,’ 1901), pp. 485-492. . 

^Yule, G. Udny, “On Changes in the Marriage- and Birth-rates in 
England and Wales, Etc.," Journal of the Royal Statistical Soefety, Vol. 6? 
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The entire science of biometrics has been built up by the 
development of correlation methods; Karl Pearson is one of the 
founders of Biometrika, the scientific organ in that field of study. 
Correlation measurement has been intensively applied in psy¬ 
chological and educational research.^ In recent years, the 
correlation method has' played an important role in the analysis 
of economic problems and in economic theory, a trend particularly 
evident in the field of agricultural economics. 


THE BIVARIATE FREQUENCY DISTRIBUTION 


The statistical basis for the study of correlation is the bivariate 
or multivariate frequency distribution. In the univariate 
frequency distributions studied in the previous chapters, the 
data were classified according to a single characteristic. In 
bivariate or multivariate distributions, data are classified accord¬ 
ing to two or more characteristics. This chapter will be con¬ 
cerned with the analysis of bivariate distributions. Chapter 
XVI mil deal with multivariate distributions. 

An Illustration of a Bivariate Distribution, Table 24 shows 
the distribution of grades of 81 freshmen in a second-semester 
English course at Mount Holyoke. For each of these 81 students 


Table 24.— Grades of 81 Mount Holyoke Freshmen in a Second- 
semester English Course 


Grades 

Frequei 

Xi 

F 

60- 

1 

80- 

0 

100- 

3 

120- 

0 

140- 

2 

160- 

9 

180- 

8 

200- 

16 

220- 

17 

240- 

13 

260- 

9 

280- 

1 

300- 

2 


81 


(1906), pp. 88-132; '^The Applications of the Method of Correlation to Social 
and Economic Statistics, Journal of the Royal Statistical Society, Vol. 72 
(1909), pp. 721-730. 

1 Rugg, Harold O., Statistical Methods Applied to Education, 
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there is also available the grade in first-semester English. Hence 
they may be cross-classified according, to both their first- and 
second-semester grades. Tliis has been done in Table 25. 

Table 25. —Bivakiatb Frequency Distribution of 81 Mount Holyoke 
Freshmen According to Their Grades in First- (X2) and Sbcond- 
(Xi) Semester English 



The bivariate frequency distribution represented by Table 25 
gives more complete information than is contained in the uni¬ 
variate frequency distribution of Table 24. Of the 8 students 
having second-semester grades between 180 and 200, the seventh 
row of Table 25 shows that 2 had first-semester grades between 
140 and 160, 4 had first-semester grades between 160 and 180, 
and 2 had first-semester grades between 180 and 200. This is a 
small univariate frequency distribution of the group of students 
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who had grades between 180 and 200 in their- second-semester 
course. In Table 25 there are 11 rows and 11 columns each of 
which contains a univariate frequency distribution. Since there 
are 11 subgroups of 11 groups, there are altogether 121 classes, 
represented in the table by 121 squares, or cells, of Which 28 cells 
contain frequencies. 

The totals of the columns of Table 25 gives the univariate 
frequency distribution of all the students classified according to 
their first-semester English grade. The totals of the rows gives 
the univariate frequency distribution of all the students classified 
according to their second-semester English grades. 

For each of the columns an arithmetic mean could be calcu¬ 
lated and the question could be answered: Did girls who earned 
high grades in their first-semester English average higher grades 
in second-semester English than did the girls who attained only 
low grades in their first-semester English? An arithmetic mean 
could similarly be calculated for each of the row frequency 
distributions. For all the 11 column frequency distributions 
and all the 11 row frequency distributions the standard deviations 
also could be calculated. In other words, in this bivariate 
frequency distribution there are 22 univariate frequency dis¬ 
tributions in addition to the 2 univariate frequency distributions 
represented by the totals for the respective variables. Each of 
these 22 frequency distributions might be analyzed in the same 
way as any frequency distribution. 

METHODS OF SUMMARIZATION AND COMPARISON IN 
BIVARIATE DISTRIBUTIONS 

The characteristics of a bivariate frequency distribution can 
be described by various statistics. Many of these are the same 
as the statistics employed in the description of a univariate 
frequency distribution, but some are new. Thus, the central 
tendency of one of the two variables may be measured by its 
mean, its mode, or its median. The dispersion of this variable 
may be measured by its range, standard deviation, average 
deviation, or quartile deviation; and its skewness and kurtosis 
may be measured by Pi and P 2 , respectively. The same is true 
for the other variable and for the numerous univariate frequency 
distributions that make up the details of a single bivariate 
distribution, as explained in the preceding paragraph. New 
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statistics are required, however, to describe the tendency of the 
variables to vary in unison. A bivariate frequency distribution 
thus presents the new problem of measuring correlation and the 
discovery of statistics for measuring it. 

Progressions of Means. If the data are grouped in the form 
of a bivariate scatter diagram such as Table 25, one way to 
measure the association between the two variables is to compute 
the mean values of one variable for various values of the other 



variable. In Table 26, for example, the means of the columns 
would show how the Xi variable tends to change on the average 
with changes in X 2 , and the means of the rows show how the X 2 , 
variable tends to change on the average with changes in Xi. 
The values of these column and row means are given in Table 26 
and graphed in Figs. 105 and 106. 

The nature of the association between the variables is evident 
from these graphs. Consider, for example, the progression of 
the means of Xi shown in Table 26 and Fig. 105. These show 
that the mean value of Xi tends to increase with increases in Xg. 
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Thus, when X 2 is between 100 and 120, the mean value of Xi is 
110; when X^ is between 200 and 220, the mean value of Xi is 
222.31; and when X 2 is between 260 and 280, the mean value of Xi 
is 266.0. Although the increase in the average value of Xi with a 
given increase in X 2 does not appear to be uniform, the progres¬ 
sion of the means of Xi with a change in X 2 does appear to follow 



a straight line. The same can be said of the progression of the 
means of X 2 with changes in Xi. 

Lines of Regression. The tendency of the progressions of 
means to follow straight lines suggests the following hypothesis: 
Consider first the progression of the means of Xi with changes in 
X 2 . Suppose that Xi is related to X 2 in such a way that an 
increase in X 2 of one unit always produces an increase in Xi of, 
say h units, h being a constant. If X 2 were the only factor affect¬ 
ing Xi, all the values of Xi, when plotted, would fall exactly on a 
straight line and the progression of all means would be perfectly 
linear. If there were other forces affecting Xi, however, causing 
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it to be higher or lower than the value expected from its associa¬ 
tion with X 2 , then the actual values would not fall on a straight 
line but would be scattered about that line. If this view of the 
variation between X\ and is adopted, a straight-line fitted to 
the data should give the law of relationship between Xx and X 2 
and the scatter about it should give the deviation from this line 
caused by the other factors affecting Xi. 

Table 26.— Means of Rows and Means of Columns 
Frqn^ correlation table showing the relationship between second- and first- 
semester grades of SI Mount Holyoke freshmen 


Progression of means of second-semester 
English grade (Xi) with successive values 
of first-semester English grade (X 2 ) 
Regression of Xi on X 2 
(Vertical frequency distributions in Table 26) 


Progression of means of first-semester 
English grade (X 2 ) with successive values 
of second-semester English grade (Xi) 
Regression of X 2 on X\ 
(Horizontal frequency distributions in 
Table 25) 


Values of X 2 

Means of Xi 

It. 

Values of Xi 

Means of X 2 

Xr 

60- 

110.00 

60- 

130.00 

80- 


80- 1 


100- 

110.00 

100- 

83.33 

120- 

152.86 

120- ! 


140- 

178.00 

140- 

160.00 

160- 

195.00 

160- 

141.11 

180- 

203.33 

180- 

170.00 

200- 

222.31 

200- 

200.00 

220- 

241.11 

220- 

225.29 

240- 

250.00 

240- 

234.62 

260- 

266.00 

260- 

256.67 

280- 

310.00 

280- 

230.00 



300- 

290.00 


A similar view could be taken of the variation in the mean 
value of A '2 with changes in Xi and would justify drawing a 
straight line to show the law of relationship between X% and Xi. 
The lines that are derived to show the relationship between the 
mean value of one variable and changes in value of another are 
called lines of regression," following Galton, who used this term 
in his original study of the relationship between the heights of 
children and the heights of their parents. 

The Line of Regression of Xx on X 2 . Suppose the above hypoth¬ 
esis is adopted, namely, that Xi is linearly related to X 2 and 
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that deviations from this relationship are the result of forces 
independent of X 2 . The statistical problem then becomes how to 
draw the line that is supposed to show this linear relationship. 

One of the simplest ways of finding the line of regression of Xi 
on X 2 is to plot the progression of the means of for various 
values of X 2 and to draw a line freehand through the means so that 
it seems to fit the progression of means. The great difficulty with 
this method is that it involves considerable personal discretion 
and that no two persons will necessarily draw the same line. 

An impersonal method of fitting a line to a given set of data is 
the so-called method of least squares.’’ This fits a line to the 
data so that the sum of the deviations of the dependent variable 
from the line is zero and the sum of the squares of the deviations is 
a minimum (hence the name ''method of least squares”). 
Mathematically, the first of these two conditions follows from 
the second, so that there is really only one condition, viz., that of 
least squares. 

The use of the method of least squares to fit lines to a set of 
data goes back to the beginning of the nineteenth century. It 
first came into prominence in 1806, when Adrien Marie Legendre 
(1752-1833) published a book on new methods of determining 
orbits of comets. After the publication of this book, Karl 
Friedrich Gauss (1777-1855), a German mathematician, claimed 
that he had been applying this principle since 1795. 

Later it was shown that, if the method is used to fit a line to a 
sample set of data, then, under particular circumstances, the line 
so determined is the best, or optimum, estimate of the population 
line. For example, if data are available as to the orbit of a comet 
or planet and if a line or curve is fitted to these data by the method 
of least squares, then the line or curve so obtained would be the 
most probable estimate of the true orbit. ^ 

The line of regression of Xi on X 2 may be derived by the method 
of least squares as follows: Consider the point P, Fig. 107. This, 
according to hypothesis, would fall at P' if there were no forces 
associated with Xx other than X 2 . Supposedly, however, there 
are other forces that are independent of X 2 and make Xi smaller 
than this average value so as to cause the point to be located 
actually at P. Since these other forces affect only Xi, the point 
is deflected in a vertical direction. The line of regression of X 1 on 
^ See Smith and Duncan, Sampling Statistics, pp. 372-375. 
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Xi is therefore to be obtained by minimizing the vertical devia¬ 
tions from the line. 

Let the equation ofiihis line of regression of Xi on X 2 be 
X[ = ai ,2 + 612X2 
The deviations would then be 

di.2 = Xi — Xj = Xi — cii .2 — 612X2 
and the problem is to determine ai .2 and 612 so as to minimize the 



Fig. 107.—Diagram illustrating the fitting of the line of regression of Xi on Xa 
by the method of least squares (vertical deviations minimized). 

sum of the squares of deviations like Xi — x[ shown in Fig. 107, i.e. 
(Xi — X'l = xi —x[ because xi = Xi — Xi and x[=^ X[ — Xi), 

S(Xi — XO^ = S(Xi — ai .2 — 612 X 2 )^ = minimum 

According to the differential calculus, the conditions for min¬ 
imizing S(Xi — ai .2 — 612 X 2 )^ are that the partial derivative 
with respect to ai .2 and the partial derivative with respect to 612 
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should both be zero. These conditions are 


dS(Xi — ai.2 — 612X2)^ _ ov^/\r ^ r TT ^ 

-—--— ai.2 — O 12 A 2 ; 

(/Cli .2 

= 0 ^= 2 di .2 

dS(Xi — ai.2 — 612X2)^ _ ov^/v _ t XT N V 

- ”” —A'Z^Ai — ai.2 — Oi2A2jA2 

= 0 = Zdi,2X2 


( 1 ) 

( 2 ) 


If the parentheses are removed, these equations may be written 

N(Ix,2 6122^2 “ S^l 
ai.22^2 “f" ^ 12 ^X 2 = S^i^2 

( 2 ai .2 = Nai .2 because ai .2 is a constant.) 

The first gives ai .2 in terms of 612 as follows: 

U 1.2 = Xi — 612 X 2 (3) 


{XXi/N = Xi, and SX 2 /X = X 2 .) 


If this is substituted in the second, the value of 612 is found to be 


2X1X2 - NX1X2 
2X1 “ XXi 


(4) 


Equations (3) and (4) thus give the values of ai .2 and 612 in terms 
of the sample values of Xi and X 2 . If these values are grouped 
into class intervals and deviations are measured from an arbitrary 
origin, the last equation may be put in the form 


where 


612 = 


_ \tlj\i2j (ill. 




(f) 


c_ 




N 


and ^ = 




N 


(5) 


If deviations are measured from the means of Xi and Xi, then 


(ll,2 = 0 


612 = 


ZX\X2 


( 6 ) 
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In the next chapter in which the work of measuring correlation 
is illustrated by numerical calculations, it is found that for the 



Fig. 108.—Diagram illustrating the fitting of the line of regression of X 2 on JTi 
by the method of least squares (horizontal deviations minimized). 


bivariate frequency distribution of Table 25, 



= 493 


5i2 — 


ii 81 

(81)(111)(57) 

;_(M)( 81 ) 


493 - 81 


(57)^ 

(8iy 


= 0.8322 


For these same data, Xi = 217.4 and Xt — 204.1 so that 
Oi.* = 217.4 — 0.832(204.1) = 47.58. The line of regression of 
Xi on Xi is thus X[ = 47.58 + 0.8322X2. 
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Line of Regression of X 2 on Xi, The line of regression of X 2 on 
Xi may also be obtained by either freehand or mathematical 
methods. A freehand line could be obtained by drawing a line 
through the progression of the means of X 2 on Xi. A mathe¬ 
matical line could be obtained by the method of least squares. 

The preceding section determined a mathematical formula for 
the line of regression of Xi on X 2 by minimizing the sum of the 
squares of the vertical deviations. Now X 2 is assumed to be the 
dependent variable, and the hne of regression of X 2 on Xi is 
determined by minimizing the sum of the squares of the horizontal 
deviations (see Fig. 108). Except for this difference, the process 
is precisely the same as that described for fitting the other line 
and will not be repeated here. If the line of regression of X 2 on 
Xi is represented by the equation 

X 2 = ^ 2.1 + b2iXi 

then minimizing S(X 2 — X^y = S(X 2 — a 2 .i — 621 X 1)2 gives the 
following values for 02.1 and 621 : 


— -^2 — & 21-^1 

(7) 

, SX 1 Z 2 - NXiXz 

O 2 I — 

( 8 ) 

■2X1 - NX\ 

or 



^') (9) 



If deviations are measured from the means of Xi 

and X 2 , then 

U 2.1 = 0 

1 

, i:,xiX2 

2A 

\ ( 10 ) 


For the data of Table 25 the line of regression of X 2 on Xi is thus 
found to be 


X '2 = -5.548 + 0.9642Xi 
This is shown in Fig. 108. 
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Interpretation of a Line of Regression. A line of regression 
of one variable on another is to be interpreted as indicating the 
values of the first (the dependent variable) that would be 
obtained for various values of the second, (the independent vari¬ 
able) if no other forces were affecting the dependent variable. 
If knowledge of the independent variable is all that is to be had, 
then the line of regression gives the best estimate that may be 
made of the dependent variable. 

The regression statistic a (that is, ai.2 or ag.i) gives the value 
of the dependent variable when the independent variable is zero. 
It is of only arbitrary significance, since its valuQ is affected by 
the origin selected for measuring the independent variable as 
well as the units of measurement. The regression statistic b 
(that is, 612 or 621) is independent of the origin selected and indi¬ 
cates the change that would occur in the dependent variable per 
unit change in the independent variable. In the line of regression 
of Xi on X2, for example, when X2 increases by one unit, Xi 
increases or decreases by 612 units depending on the sign of 612. 
The value of 612 will not be affected by proportional changes in 
the units of Xi and X2. Similar statements hold for 621 in the 
case of the line of regression of X2 on Xi. 

Standard Deviation about Means or Line of Regression. If the 
progressions of the means or the lines of regression are used to 
measure the average relationship between two variables, some 
additional measure is desirable to determine the degree of repre¬ 
sentativeness of these measures. In the case of a monovariate 
distribution, it will be recalled^ the representativeness of the 
mean depended upon how closely the cases were scattered around 
this mean value. This dispersion was measured by the standard 
deviation or some other measure. Similarly, in the present 
instance, the representativeness of the means of Xi, say, for 
various values of X2 will be shown by the dispersion of the cases 
around each mean. The standard deviation of the cases in each 
column around the mean of that column may thus be taken to 
show how well the mean represents the cases in the column. 
The same is true for any row. 

In Table 27 are ^ven the standard deviations of the colunms 
of Table 25 . The zero values refer to the columns in which 
there is only one case. The other values center around 16 , their 
average being 16 . 9 . 
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It is to be noted that the average standard deviation from the 
means of the columns, as well as the individual standard devia¬ 
tions from which this average is calculated, are considerably 
less than the total standard deviation of Xi, namely, <ri = 43.9. 
The column means are thus much more representative of the 
column values of Xi than the grand mean is of all the XiS. 


Table 27.— Standakd Deviations fob Columns of Table 25 


Column 

Nc 


- LFxe '^ 

Nc 

II 

b 

(1) 

2 

0 


0 

(2) 

0 




(3) 

1 


0 

0 

(4) 

7 

8,342.86 

1,191.8 

34.5 

(5) 

5 

480.00 

96.0 

9.8 

(6) 

8 

1,400.00 

175.0 

13.2 

(7) 

9 

4,800.00 

533.3 

23.1 

(8) 

13 

2,830.77 

217.8 

•14.8 

(9) 

18 

6,577.78 

365.4 

19.1 


11 

3,200.00 


17.1 

(H) 

5 

320.00 


8.0 

(12) 

2 



0 


81 





This may be explained by the fact that much of the total varia¬ 
tion in Xi is due to the variation from column to column, a 
variation that is presumably due to association with X 2 . When 
this variation is eliminated, the remaining variation is consider¬ 
ably reduced. A similar analysis would show the same results 
with respect to variation around the means of the rows. 

If the association between Xi and X 2 is measured by a straight 
line, the representativeness of this line may be measured by the 
dispersion of cases around it. Such a measure would be the 
standard deviation of the deviations from the line. The stand¬ 
ard deviation of the vertical deviations from the line of regression 
of Zi on X 2 will measure the representativeness of that line, and 
the standard deviation of the horizontal deviations from the line 
of regression of A '2 on Xi will measure its representativeness. 
In either case, equals the sum of the squared deviations from 
the line divided by N. If the line is fitted by the method of 
least squares, the sum of the squared deviations from the line 
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may be computed from the equations^ 

N(rl2 = - ai.22:Xi - ftiaSXiXa 

and 


N ( 72 ,i — 2^2.1 — 2X2 — tt2.lSX2 — 6212X2X1 


or 


<^ 1.2 — 


2 d ?.2 

N 


and 


_2 

<^ 2.1 


2 ^2.1 


( 11 ) 

( 12 ) 


These standard deviations from the lines of regression will always 
be less than the total standard deviation, because the variation 
represented by the line of regression has been eliminated by 
taking deviations from the line. 

The average standard deviations around the means of columns 
or rows and the standard deviations around the lines of regression 
may be called ^‘first-order standard deviations,’^ in contrast ta 
the total standard deviations, which may be called “zero-order 
standard deviations.” Sometimes the first-order standard 
deviations are called “standard errors of estimate” since they 
indicate the error involved in using a column or row mean or a 
line of regression as an estimate of the dependent variable. 

If the association between Xi and X 2 , say, is assumed to be 
measured by the means of Xi for given values of X 2 or by the 
line of regression of Xi on X 2 , then the smallness of the first- 
order standard deviations relative to the zero-order standard 
deviations will give some measure of the degree of representative¬ 
ness of these measures of association. As will be seen in the next 
section, this measure of the degree of representativeness of a line 
of regression is closely related to the so-called “Pearsonian 
coefficient of correlation.” As a measure of the degree of repre¬ 
sentativeness of a progression of means, it is closely related to the 
“correlation ratio,” which is discussed in Chap. XV, Nonlinear 
Correlation. 

The Pearsonian Coefficient of Correlation. The progressions 
of means and lines of regression described above were concerned 
with describing the “law of relationship” between the two 
variables. They gave the average value of one variable associ- 

^ The proof of this is as follows: 

“ 2J(ii.2(Xi — O1.2 — 6 i 2-X^2) “ S(ii.2Xi — (ll.2^di,2 ”■ 61.22^1.2X2 
But by the least-squares equations (1) and (2) 'Edu2 = 0 and Sdi.2X2 = 0 . 
Hence ~ ^^1,2X1 = SXj — U1.22X1 — 612SX1X2. 
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ated with given values of the other variable and showed how 
these average values tended to change in unison with the other 
variable. In this section, a measure of the degree of association 
between the two variables will be described. This measure is 
known as the Pearsonian coefficient of correlation after the man 
who devised it. 


Exports 



Fig. 109.—A bivariate scatter diagram showing the joint variation in imports 
into and exports from the United States. [United States Department of Commerce, 
Monthly Summary of Foreign and Domestic Commerce of the United States, Vol. 20 
{March, 1940), p. 37; Survey of Current Business, Vol. 21 {March, 1941), p. 37; 
Vol. 22 {March, 1942), pp. 5-19.] 


The coefficient of correlation suggested by Karl Pearson in 
1890 is 


_ SXiX2 
Naicr2 


(13) 


In this equation, Xi and X 2 refer to deviations from the mean 
and N to the number of pairs of cases. For the sake of simplicity, 
this coefficient will now be explained by reference to a bivariate 
distribution in which the cases are not grouped into class intervals. 

Arithmetic View of r. Table 28 and Fig. 109 show the joint 
variation of two variables. They indicate that the large values 
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of Xi representing total exports from the United States, 1932- 
1941, are associated, for the most part, with the large values of 
Z 2 , which represent total imports for consumption into the 
United States during the same period of years. 

The average of X\, designated Xi, is found to be 2.89; and the 
average of X 2 , designated X 2 , is 2.19. The deviations of each 


Table 28.— Exports and Imports of Merchandise, United States, 

1932-1941 

(In billions of dollars) 



Total exports 
Xi 

[Total imports 
Xt. 

Deviations from 

I respective X 

Product deviations 

XlX2 

Year 










XI 

X2 

i 



(1) 

(2) 



i (5) 




(3) 

(4) 








+ 


1932 

1.6 

1.3 

-1.29 

-0.89 

1.1481 


1933 

1.7 

1.5 

-1.19 

-0.69 

0.8211 


1934 

2.1 

1.6 

-0.79 

-0.59 

0.4661 


1935 

2.3 

2.0 

-0.59 

-0.19 

0.1121 


1936 

2.5 

2.4 

-0.39 

0.21 

1 . 

-0.0819 

1937 

3.3 

3.0 

0.41 

0.81 

0.3321 


1938 

3.1 i 

1.9 

0.21 

-0.29 


-0.0609 

1939 

3.2 

2.3 

0.31 

0.11 

0.0341 


1940 

4.0 

2.6 

1.11 

0.41 

0.4551 


1941 

5.1 

3.3 

2.21 

1.11 

2.4531 


2 = 

28.9 

21.9 

0 

0 

5.8218 

-0.1428 


.£1 = 2.89 

= 2.19 



XX 1 X 2 = 

= 5.6790 

1 


variable from its respective mean are calculated and entered in 
the third and fourth columns of the table. The products of Xi 
and X 2 , the product deviations, are calculated, and the results 
entered in the appropriate division of the last column. The sum 
of column (5), that is, l^iXiX^ (the sum, of the product deviations), 
is 5.679. 

In Fig. 109, an Xi and an X 2 scale are set up in such a way 
as to accommodate the range of these variables as shown in 
columns (l).and (2) of Table 28. Lines perpendicular to the 
respective scales at the points Xi = 2.89 and X 2 = 2.19 are 
drawn so that the figure is divided into four quadrants, quadrant 
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I containing values of Xi and that are both higher than 
average (hence both x\ and are positive); quadrant II contain¬ 
ing values of that are smaller than average and values of Xi 
that are larger than average (hence X 2 is negative and xi is posi¬ 
tive); quadrant III containing values of Xi and X 2 >that are both 
smaller than average (hence both xi and X 2 are negative); and 
quadrant IV containing values of X 2 that are larger than average 
and values of Xi that are smaller than average (hence X 2 is 
positive and Xi is negative). The origin of the coordinates xi, 
X 2 , is at the intersection of the perpendicular lines at the Xi 
and X 2 of the scales. For example, measured from the original 
origin, the point P has coordinates Xi = 3.3, X 2 = 3.0; but 
measured from the intersection of the means the coordinates of 
point P are X 2 = 0.41, xi = 0.81 [see columns (1), (2), (3), and 
(4) for 1937, Table 28]. ^ It should be noted that only one point 
is plotted in the fourth quadrant; this is the 1936 pair of variables 
from Table 28. The 1938 pair of variables from Table 28 
appears as the sole point in the second quadrant. These two 
pairs of variables, 1936 and 1938, are the only ones in the set 
that have negative product deviations. The rest of the pairs of 
observations appear either in .the first or third quadrant because 
their product deviations are positive quantities. 

If the fluctuations of two variables are so associated that their 
plottings appear predominantly in quadrants I and III, the 
ZX 1 X 2 will be positive. This will be so when larger than average 
values of Xi are associated with larger than average values of 
X 2 (quadrant I) and smaller than average values of Xi are 
associated with smaller than average values of X 2 (quadrant III). 
Also, if the two variables are so associated that their plottings 
appear predominantly in quadrants II and IV, the sum of the 
product deviations will be negative. This will be so when smaller 
than average values of X 2 are associated with larger than average 
values of Xi (quadrant II) and when larger than average values 
of X 2 are associated with smaller than average values of Xi 
(quadrant IV). Furthermore, if the plottings are equally 
distributed throughout the four quadrants, the sum of the 
product deviations will approach zero because of the canceling 
of plus and minus product deviations. This will be so vrhen 
there is no tendency for association of the variables in any manner, 
that is, when smaller than average values of Xi are associated 
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about as often with larger values of X 2 as with smaller values of 
■^ 2 ) etc. 

A similar procedure is followed in Table 29 and Fig. 110, in 
which Xs is the price of United States government bonds and 
X 4 is the yield on such bonds. Casual inspection of the data 
reveals that, when the price of bonds is high, yield is low, and 
vice versa. 

Table 29.— Prices and Yields on United States Government Bonds, 

1932-1941 


Averages on bonds outstanding due or callable after 12 years 


Year 

Average 

price 

($100 par) 

Xz 

Average 
yield, 
per cent 

Y 4 

Deviations from 
respective means 

Product deviations 

XtXi 

a-j 



(1) 

(2) 

(3) 

(4) 

(5) 






+ 

— 

1932 

98.8 

3.68 

-5.62 

0.949 


-5.333 

1933 

102.3 

3.31 

-2.12 

0.579 

. 

-1.227 

1934 

104.6 

3.12 

0.18 

0.389 

0.070 


1935 

105.5 

2.79 

1.08 

0.059 

0.064 


1936 

103.7 

2.65 

-0.72 

-0.081 

0.583 


1937 

101.7 

2.68 

-2.72 

-0.051 

0.139 


1938 

103.4 

2.56 

-1.02 

-0.171 

0.174 


1939 

106.0 

2.36 

1.58 

-0.371 


-0.586 

1940 

107.2 

2.21 

2.78 

-0.521 


-1.448 

1941 

111.0 

1.95 

6.58 

-0.781 


-5.139 

2 = 

1,044.2 

27.31 

0 

0 

1.030 

-13.733 


^3 = 104.42 

= 2.731 



or net 


1 




2 x 30:4 = 

-12.703 


The sum of the product deviations in Table 29 is a negative 
amount, namely, —12.703. Comparison of Figs. 109 and 110 
will at once bring out the contrast in the location of pairs of 
plotted points. Whereas in Fig. 109 the points are mainly in 
quadrants I and III, the points in Fig. 110 appear principally in 
quadrants II and IV. 

Again, the same procedure is followed in Table 30 and Fig. Ill, 
in which X 5 is the height of Princeton freshmen and Xe is the 
grade of these freshmen in their examination in economics. 

In Table 30 the negative and positive product deviations so 
nearly offset each other that the sum of product deviations is 
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only 1.33. The tendency for the scatter of points throughout 
all four quadrants is depicted in Fig. Ill on page 345. 

These three arithmetic illustrations appear to show that 
the sum of the product deviations from the arithmetic means 
of variables, ZxiX 2 , can be used to measure the extent to which 

Average price 



Fig. 110.—A bivariate scatter diagram showing the joint variation in the 
price and yield of United States government bonds. [Federal Reserve Bulletin, 
December, 1938, p. 1045; July, 1940, pp, 701-702, and Survey of Current Business, 
Vol. 21 (March, 1941), p. 36; VoL 22 (March, 1942), p. 18.] 

the variables are associated or related. Following are the reasons 
for this: 

1. When smaller than average values of Xi are associated 
with smaller than average values of X 2 , the X 1 X 2 products, being 
— xi and — X 2 , are positive, as shown in Tables 28 to 30. 

2 . When larger than average values of Xi are associated with 
larger than average values of X 2 , the XiX^ products, being -l-a;! 
and +X 2 y are also positive, as shown in Tables 28 to 30. 
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3. On the other hand, when smaller than average values of 
Xi are associated with larger than average values of X 2 , the 
X 1 X 2 products, being —Xi and +X 2 , are negative, as shown for 
1932 and 1933 in Table 29. 

4. When larger than average values of Xi are associated with 
smaller than average values of X 2 , the X 1 X 2 products, being +xi 
and —X 2 , are also negative, as shown for 1935, 1936, and 1939 in 
Table 29. 


Table 30.— Heights of Freshmen, Princeton Glass of 1941, and 
Grades on Examination in Economics 


Heights of 
freshmen, 
in. 

Xi 

Grades of 1 
freshmen, 
percentage 
of 100 

Xi 

Deviations from 
respective X 

Product deviations 

XtXi 

X6 

Xi 

0) 1 

1 

(2) 

(3) 

(4) 

(5) 





+ 

— 

66.00 

70 

-3.96 

+3.8 


15.048 

69.00 

67 

-0.96 

+0.8 


0.768 

70.50 

66 

+0.54 

-0.2 


0.108 

69.50 

85 

-0.46 

+18.8 


8.648 

68.00 

55 

-1.96 

-11.2 

21.952 


70.50 

60 

+0.54 

-6.2 


3.348 

70.50 

67 

+0.54 

+0.8 

0.432 


71.60 

81 

+1.64 

+14.8 

24.272 


73.25 

66 

+3.29 

-0.2 


0.658 

70.75 

45 

+0.79 

-21.2 


16.748 

S = 699.60 

662 


! 

4-46.656 

-45.326 

2, = 69.96 

^6 = 66.2 



or net 






+1.330 


5. When no consistent association prevails between the pairs 
of variables observed, the +X 1 X 2 products will balance or very 
nearly balance the — X 1 X 2 products, as shown in Table 30. 

The sum of the products of the deviations from the means 
indicates correspondence or lack of correspondence of variations 
in two sets of variables; but the simple sum of products cannot 
be taken as the measure of correlation between the two variables 
for the following reasons: 

1. The sum of product deviations for one set of paired vari¬ 
ables is not comparable with a similar sum of product deviations 
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for another set of paired variables. A small sum of product 
deviations may result from the fact that a small number of cases 
is included, and a large sum of product deviations may indicate 
merely that a large number of cases is involved; and yet the actual 
degree of correlation might be the same in the tw6 sets. In the 

Freshmen heights 



Fig. Ill ,—A bivariate scatter diagram showing the joint variation (or lack 
of it) between the heights of Princeton freshmen and their grades on an exami¬ 
nation in economics. 

second instance the larger sum of product deviations is due solely 
to the fact that it resulted from a larger number of cases. It 
seems obvious that an average of the product deviations is 
required. Such an average can be obtained by dividing the 
sum of product deviations by N, Thus the average product 
deviation is ' 2 x 1 X 2 /N. 

2 . The product deviations in terms of original units of the 
data are without meaning because of nonhomogeneity of units. 
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Suppose the Xi variable is the price of wheat per bushel, which 
would be expressed in dollars and cents; and the variable is 
the birth rate. Or, again, suppose the Xi variable is the height 
of men expressed in inches and the X 2 variable is the weight of 
men expressed in pounds. Or suppose the Xi variable is the 
marriage rate and the X 2 variable is the volume of trade, or 
prices, etc. In all such pairs of variables, the product deviations 
in terms of original units are meaningless; they are products of 
nonhomogeneous things. What meaning can be ascribed to the 
product of inches and pounds or to the product of marriage rates 
and volume of trade? It is necessary to perceive in the situation 
a general common denominator. 

The comparable thing being compared is the purely abstract 
thing, deviation above or below average; accordingly, the stand¬ 
ard deviation <r may be used as a general common denominator. 
Whatever the original unit of measurement, if normally dis¬ 
tributed the standard deviation represents approximately 
one-sixth the range of that variable. The standard deviation 
is a unit of deviation from the mean measuring a common 
characteristic among all variables and is, therefore, a homo¬ 
geneous unit among all variables. Consequently, the standard 
deviation is used to reduce these product deviations to terms of 
comparability with each other. When this is done, the average 
product deviation becomes a measure of correlation known as the 
Pearsonian coefficient, namely, 

V El £5 

_ A/ <Ti(r 2 
N 

Since cri and 0-2 are constants in each particular problem, this 
equation may be written as follows: 

_ XX1X2 

This is the usual form in which the formula for the Pearsonian 
coefficient of correlation is given. The value of this average 
expression fluctuates between the limits +1 and —1. Any 
value greater than +1 or less than — 1 is a mistake, not an error 
in the statistical sense. If r = +1, this means perfect positive 
correlation (large values of Xi are associated with large values of 
X 2 , and vice versa); if r = —1, this means perfect negative 
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correlation (large values of Xi are associated with small values of 
X 2 , and vice versa); if ri 2 = 0, this means no linear correla¬ 
tion. 

Calculation of the Coefficient of Correlation. The data in 
Table 28 may be taken to illustrate the detailed calculation 


Table 31 .— Calculation of Coefficient of Correlation between 
United States Exports and Imports, 1932-1941 


Deviations from 
respective means, 
billions of dollars 

Squares of deviations 
from respective means 

Deviations from 
respective means in 
standard deviation 
units 

Product deviations in 
standard-deviation 
units 




X ’ 

XI 

X2 

Xl X2 







a\ 0-2 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 







+ 


- 1.29 

- 0.89 

1.6641 

0.7921 

- 1.251 

- 1.435 

1.795 


- 1.19 

- 0.69 

1.4161 

0.4761 

- 1.154 

- 1.112 

1.283 


- 0,79 

- 0,59 

0.6241 

0.3481 

- 0.766 

- 0.951 

0.728 


- 0.59 

- 0.19 

0.3481 

0.0361 

- 0.572 

- 0.306 

0.175 


- 0.39 1 

0.21 

0.1521 

0.0441 

- 0.378 

0.338 


- 0.128 

0.41 

0.81 

0.1681 

0.6561 

0.398 

1.306 

0.520 


0.21 

- 0.29 

0.0441 

0.0841 

0.204 

- 0.467 

. 

- 0 . C95 

0.31 

0.11 

0.0961 

0.0121 

0.301 

0.177 

0.053 


1.11 

0.41 

1.2321 

0,1681 

1.077 

0.661 

0.712 


2.21 

1.11 

4.8841 

1.2321 

2.144 

1.789 

3.836 




10.6290 

3.8490 



S = 9.102 

- 0.223 







or net 



cri = 1.031 

0-2 = 0.6204 











V ^ - 

= 8.879 







Li a\ <72 



The standard deviations were calculated from the sum of columns (3) and (4). 


of the coefficient of correlation, by first making all product 
deviations in terms of respective standard-deviation units. 

The Pearsonian coefficient of correlation may now be quickly 
calculated from the sum of product deviations in standard- 
deviation units [the foot of column (7) of Table 31]. This sum 
divided by N is the coefficient of correlation. In other words, 

o’io’2 8.879 

^ “ ~~N 10 “ 

= 0.8879 
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It is not necessary, however, to divide each deviation by its 
standard deviation because the two standard deviations are 
constants. Table 28 having been constructed, if the standard 
deviations are calculated, as in columns (3) and (4), Table 31, 
it is then necessary only to use Eq. (3), as follows: 

_ 1:.X\X2 __ 5.6790 _ 0.5679 

^ ~ ~ 10(1.031) (0.6204) “ 0.6396 

= 0.8879 

Accordingly columns (5) to (7) of Table 31 need not be computed. 
For example, to calculate the coefficients of correlation for the 
data in Tables 29 and 30, the standard deviations are calculated 
and the respective coefficients of correlation are then obtained 
as follows: 

Correlation between prices and yields on United States 
government bonds, 1932-1941: 

_ Y,x^\ _ —12.703 

~ 10.(3.16)(0.51) 

_ -1.2703 
1.6116 
= -0.7882 

Correlation between heights and grades of freshmen: 

__ 2x^x& __ 1.33 

~ iV<^ 8 <r 6 ~ 10(1.89)(10.96) 

0.133 _ 0.133 = 1.889 

(1.89) (10.96) 20.7 o-e = 10.96 

= 0.0064 _ 

For a small number of cases it is possible to calculate a coeffi¬ 
cient of correlation according to the procedure illustrated in the 
tables and calculations immediately preceding. For a large 
number of pairs of values it is desirable to group the pairs into 
class intervals. The value of Xi for each pair then becomes the 
mid-point of the interval to which the Xi value belongs; the 
value of X 2 for the pair will be the mid-point to the interval 
to which the X 2 value belongs. If more than one pair of cases 
belongs to the same Xi and X 2 intervals, the frequency of such 
pairs is determined. This procedure was illustrated by the 


0-3 = 3.16 
(74 = 0.51 
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analysis of Tables 24 and 25 in discussing the bivariate frequency 
distribution of 81 Mount Holyoke freshmen.^ When the bivari¬ 
ates are arranged in a bivariate frequency distribution, ru is 
measured by 'LFxiX 2 /N<tiG 2 where F represents the frequency of 
pairs of values belonging to the same Xi and X 2 intervals. For 
example, in Table 25 (for Xi = 160-, X 2 = 120-), F = 5, 
Xi = 47.4, and x^ = 74.1. Accordingly, this Fx\X 2 (for Xi = 160-, 
X 2 = 140-) is equal to 5(47.4)(74.1) = 17,561.7. When this pro¬ 
cedure is followed for the entire table, the ll,FxiX 2 is obtained. 
Special methods for calculating r from grouped data are described 
in detail in Chap. XIV, in which advantage is taken of certain 
short-cut procedures. 

Relationship between Lines of Regression and r. If a line 
of regression is fitted by the method of least squares, the values 
of bi 2 and 621 are given by Eqs. 6 and 10. It will now be shown 
that these reduce to formulas involving ru. From the defini¬ 
tion of r = 'ZXiX 2 /N(Ti<T 2 f 


l^XiX2 = N(Ti(T2Tl2 

Secondly, note that, from the definition of (x\ = Sx^/iV, 

2x1 = N<tI 

SX1X2 N(y\(T 2 ru 


Hence, 


h\2 — 


sxi n<tI 
In the same manner it can be shown that 


— r 12 


0-2 


b U2 

21 — Tu — 

O’! 

Hence, if deviations of the variables are measured from their 
mean values, the lines of regression may be written (in this case 
the ai .2 and a 2 .i are both zero) 

x'l = ri 2 -X 2 (14) 

(72 

X 2 = ri2 — Xi (15) 

If the first of these equations is divided by <ti and the second by 
* See pp. 325-326. 
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(72, they become 


0-1 



_ Xi 

-ri2 — 

(72 (Ti 


Thus it may be concluded that if the variables are measured in 
standard-deviation units, the slopes of the lines of regression are 
the Pearsonian coefficient of correlation. In this light, ri 2 is 
the change in the average value of Xi expressed in <r units when 


^2 



Fig. 112.—Diagram showing relationship between lines of regression and the 
Pearsonian coefficient of correlation r. 


X 2 changes by one <72 unit. It is also the change in the average 
value of X 2 expressed in <r 2 units when Xi changes by one < 7 i unit. 

This property of r is shown geometrically in Fig. 112. This 
shows that the slope of the regression of Xi on X 2 is r, with 
reference to the X 2 -axis, and 1/r with reference to the Xi-axis; 
that is to say, the line of regression of Xi on X 2 makes an angle 
with the X 2 -axis equal to r and an angle with the Xi-axis equal 
to 1/r. The slope of the regression of X 2 on Xi is likewise r, 
but with reference to the Xi-axis, and 1/r with reference to the 
X 2 -axis; that is to say, the line of regression of X 2 on Xi makes an 
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angle with the Xi-axis equal to r and an angle with the X 2 -axis 
equal to 1 /r. In other words, in Fig. 112, angle a equals a', 
and angle 6 equals angle All this is on the assumption 
that the variables are expressed in standard-deviation units as 
indicated in the equations above. 

Thus, in Fig. 112, the tangent of the angle a is r, and that of 
the angle 6 is 1 /r. When \a\ ^ 7 r/ 4 , r = tan a g 1 . Geo¬ 
metrically, within the limits \a\ ^ tt/A, tan a varies between +1 
and — 1 , passing through zero, and tan 6 between +1 and — 1 , 
passing through infinity. The two lines of regression merge into 
one line when r = 1 (for tan a = 1 when the angle is a 45-degree 
angle). 

Relationship between r and the First-order Standard Devia¬ 
tion. It will be recalled that the standard deviation of the 
deviations from the line of regression of Ai on A 2 is equal to 

- ai. 2 SXi - 612 SX 1 X 2 

If the variables are measured from their mean values, this 
becomes 

^<71.2 = ^Xl — bi2'I^XiX2 

But 

I,x\ = Na\ bi 2 = — and Sx-iXj 

<72 

Hence, 

N<tU = N<tI - iV(7!r?2 

and 

<^ 1.2 = <^i(l “■ ^ 12 ) 

Finally, 

<71.2 = <7i V 1 — rfa 

In the same manner, 

<^ 2,1 = < 7’2 1 ^12 

These formulas may also be put in the form 

rl2 = 1 - ^ (18) 

<^1 

rh = I- (19) 

<72 

It will thus be seen that r is closely related to the scatter 
about the lines of regression. If this scatter is a small percentage 


= N<Ti<T2ri2 


(16) 

(17) 
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of the total scatter, indicating a high degree of representativeness 
of a line of regression, then r is high. If the scatter is a large 
percentage of the total variation in the dependent variable, 
indicating a low degree of representativeness of a line of regres¬ 
sion, then the value of r is small. In other words, the better 
a line of regression fits the data, the higher the value of r, and 
vice versa. The Pearsonian coefficient of correlation is thus a 
measure of the goodness of fit of the lines of regression. 

The Pearsonian Coefficient of Correlation and the Analysis of 
Variance, For every point on a bivariate scatter diagram such 
as Fig. 107, there is a corresponding point on the line of regression 
of Xi on Z 2 . Geometrically this is obtained by projecting the 
point vertically onto the line of regression (see Fig. 107). Alge¬ 
braically, the Xi coordinate of a point on the line of regression is 
found by substituting the given value of Xz in the regression 

equation x[ = ri 2 — X 2 - 
0’2 

When the variables are measured from their mean values, the 
mean of the various values of Xi is zero. Hence the mean of 
the corresponding values of x[ is zero also. The standard 
deviation of these x[ values is thus 


N 


2 ^1 ^^2 2 2 


Equation (16) may thus be written 

2 _.2 _2 

^ 1.2 ” 

or 

+ ^?.2 ( 20 ) 

This says that the total variance of the Xi values is equal to the 
variance of the corresponding points on the line of regression 
plus the variance of the deviations from these points. Another 
way of looking at this is to regard the total variation in Zi as 
made up of two parts, one consisting of the variation (o-^^/) due 
to its association with X 2 as represented by the line of regression, 
the other representing the variation in Xi due to its association 
with factors independent of X 2 (that is, al^ 2 )- 
Similar analysis shows that 


<4 = + «^ 2.1 


( 21 ) 
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in other words, that the total variance in X 2 is made up of a part 
due to its association with Xi as represented by the line of 
regression of Z 2 on Zi and a part ((Tg.i) due to its association 
with factors independent of as measured by the deviations 
from this line of regression. 

The formula (t\^, = r\^\^ which may also be written rfg = 
sheds further light on the meaning of r. It shows that 
measures the proportion of the total variance in Xi that is due 
to its association with X 2 . It also measures the proportion 
of the total variance in X 2 that is due to its association with Zi. 



CHAPTER XIV 


COMPUTATION OF r AND OTHER MEASURES 
OF CORRELATION 

The previous chapter was concerned with an explanation 
of the various devices used to measure the association between 
two variables. This chapter will illustrate their use by carrying 
out a numerical analysis. Only linear correlation will be con¬ 
sidered here. Measures of nonlinear correlation will be discussed 
in Chap. XV. 

The order of analysis will be first to calculate the correlation 
coefficient. This will be done for both ungrouped and grouped 
data, and use will be made of short-cut methods of calculation. 
For the grouped data, lines of regression will be computed, and 
first-order variances and standard deviations. Reference will 
again be made to the progressions of means, but the analysis will 
be continued no further than in the previous chapter. 

Computation of r from Ungrouped Data. Since x = X — Xy 
it follows that 

2x1X2 = S(Zi - Xi){X 2 - X2) = 2X1X2 - NX1X2 
Likewise, <ri and < 7-2 are equal to 

Vt - -JW’ - 

Hence the correlation coefficient can be computed from the 
equation 



To illustrate the use of this formula consider again the data 
on exports and imports of Table 28.^ These are reproduced 
in Table 32, together with the calculations of 2 X 1 X 2 , Xi, X 2 , 
2X1, 2X1. Three check columns are also employed. The 

checks are column (1) + column (2) = column (6); 

column (3) + column (4) = column (7); 

and column (3) + column (5) = column (8). 

^ See p. 340. 
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In the preliminary calculations of Table 32, the last check 
failed. This showed that a mistake had been made in either 
column (5) or column (8), for column (3) checked with columns 
(4) and (7). After some investigation the mistake was found in 
column (8). By dividing the checks up in this way, an error 
can be easily located. This sort of check is called a “Charlier 
check.” 


Table 32.—Wokk Sheet for Computing r from Ungrouped Data 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 

Xi 

X2 

X1X2 


X22 

Xi + X2 

Xi(Xi + X2) 

X2(Xl + X2) 

1.6 

1.3 

2.08 

2.56 

1.69 

2.9 

4.64 

3.77 

1.7 

1.5 

2.55 

2.89 

2.25 

3.2 

5.44 

4.80 

2.1 

1.6 

3.36 

4.41 

2.56 

3.7 

7.77 

5.92 

2.3 


4.60 

5.29 

4.00 

4.3 

9.89 

8.60 

2.5 

2.4 

6.00 

6.25 

5.76 

4.9 

12.25 

11.76 

3.3 


9.90 

10.89 

9.00 

6.3 

20.79 

18.90 

3.1 

1.9 

5.89 

9.61 

3.61 

5.0 

15.50 

9.50 

3.2 

2.3 

7.36 

10.24 

5.29 

5.5 

17.60 

12.65 

4.0 


10.40 

16.00 

6.76 

6.6 

26.40 

17.16 

5.1 

3.3 

16.83 

26.01 

10.89 

8.4 

42.84 

27.72 

S = 28.9 

21.9 

68.97 

94.15 

51.81 

50.8 

163.12 

120.78 

Xi - 2.89 

Xj = 2.19 








Checks: 

2 ( 1 ) + 2 ( 2 ) = 2 ( 6 ) 
28.9 + 21.9 = 50.8 
2(3) + 2(4) = 2(7) 
68.97 + 94.15 = 163.12 
2(3) + 2(5) = 2(8) 
68.97 + 51.81 = 120.78 


From Table 32, r is found according to Eq. (1) to be equal to 


ri2 


_ 68.97 - 10(2.89)(2.19) _ 

•v/(94.15 - 10 X 2:89") V(51.81 - 10 X OT) 
68.97 - 63.291 


V(94.15 - 83.521) V(51.81 - 47.961) 
5.679 5.679 


^(10.629) V'(3.849) 
5.679 


(3.26)(1.962) 


6.396 


= 0.8879 


<ri = V10629 
(Ts = VO.3849 
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Table 33.— Grades in Second- and First-semester English, 81 Fresh¬ 
men AT Mount Holyoke 

(A, B, C, and D grades have been converted to a numerical scale) 


Student 

number 

Second- 

semester 

grade 

Xi 

first- 

semester 

grade 

Student 

number 

Second- 

semester 

grade 

Xi 

first- 

semester 

grade 

Xz 

1 

240 

220 

41 

260 

260 

2 

200 

180 

42 

180 

160 

3 

260 

240 

43 

100 

60 

4 

260 

260 

44 

200 

220 

5 

160 

160 

45 

200 

200 

6 

240 

220 

46 

160 

120 

7 

220 

200 

47 

180 

160 

8 

60 

120 

48 

280 i 

220 

9 

220 

240 

49 

200 

200 

10 

200 

180 

50 

220 

220 

' 11 

220 

220 

51 

220 

200 

12 

140 

180 

52 

240 

220 

13 

160 

120 

53 

100 

60 

14 

240 

260 

54 

220 

220 

15 

260 

240 

55 

240 

200 

16 

200 

160 

56 

200 

220 

17 

200 

160 

57 

220 

220 

18 

240 

240 

58 

220 

200 

19 

240 

220 

59 

240 

200 

20 

240 

220 

60 

180 

140 

21 

1 160 

140 

61 

160 

140 

22 

220 

240 

62 

240 

220 

23 

200 

200 

63 

__260 

260 

24 

100 

100 

64 

160 

120 

25 

160 

140 

65 

260 

240 

26 

200 

160 

66 

220 

180 

27 

180 

180 

67 

' 220 

240 

28 

180 

160 

68 

260 

260 

29 

240 

240 

69 

240 

220 

30 

200 

200 

70 

200 

200 

31 

200 

200 

71 

140 

120 

32 

180 

160 

72 

260 

240 

33 

260 

220 

73 

200 

180 

34 

160 

120 

74 

300 

280 

35 

240 

240 

75 

180 

140 

36 

220 

220 

76 

220 

180 

37 

220 

240 

77 

180 

180 

38 

160 

120 

78 

300 

280 

39 

200 

200 

79 

220 

220 

40 

220 

220 

80 

220 

200 




81 

200 

180 
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This agrees to two decimal places with the previous calculations 
of this coefficient made in Chap. XIII. The difference is 
due to the different ways of rounding off decimals in making the 
calculations. 

Computation of r from Grouped Data. The Data. The data 
to be used to illustrate the computation of r for grouped data 
are given in Table 33. They may be explained as follows: 

First pair Xi, X 2 . The first pair of observations are the 
second-semester and the first-semester English grades, respec¬ 
tively, of student 1, viz., 240,220. 

Second pair Xi, X 2 . The second pair of observations are the 
second-semester and the first-semester English grades, respec¬ 
tively, of student 2, viz., 200,180. 

Third pair X\, X^. The third pair of observations are the 
second-semester and the first-semester English grades, respec¬ 
tively, of student 3, viz. 260,240. 

The Correlation, or Bivariate Frequency, Table. After the data 
have been tabulated as in Table 33, a correlation table, which 
is in effect a bivariate frequency distribution, is constructed. 
The table is set up with class-interval scales suitable for each 
variable,^ and additional columns and rows are arranged for the 
required calculations. In the center of each cell of the correla¬ 
tion table, frequencies are shown; for example, in the first 
column opposite the Xi scale of Table 34, 2 is the frequency of 
occurrence of Xi between 100 and 120 and X 2 between 60 and 80. 
Two students, in other words, have grades in second-semester 
English between 100 and 120 and grades in first-semester English 
between 60 and 80. When all the frequencies are recorded in 
the correlation table, it may be used as a work sheet for the 
calculation of the coefficient of correlation. 

Short Method for Calculating r. Like the standard deviation 
and the mean, it is possible to find r by a short method making 
use of arbitrary origins. 

In the formula for r, 


XFXiX2 


( 2 ) 


(Ti and (T 2 may be calculated by the short method that has already 
been presented. ^ It remains only to evaluate 2FxiX2 in terms 
^ On the question of proper selection of class intervals, see pp. 199-206. 
2 See pp. 214-215. 



Table 34.—Correlation Table 

Showing the relationship between second-semester (Xi) and first-semester {X 2 ) grades of 81 Mount Holyoke freshmen 
60- 80- 100- 120- 140- 160- 180- 200- 220- 1 240- 260- 280- 300- 320- 340- S t F 0^ * F 



! 195.001 
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of deviations from the arbitrary origins, Ai and A 2 , selected 
for the respective variables and in terms of the two correction 
factors Cl and C 2 . To do this, note that^ 


where 


and 

where 


Therefore, 


Xi = di — Cl 
^ _ ^Fdi 

X 2 ^ d^ — C 2 
^ ^Fd2 

'ZFxiXz = 2 F(di — Ci)(d2 — C2) 


which expanded is as follows: 

^FxiX2 = 2iF’c?i6?2 — CiIjFd^ — C^^Fdi -j- NC 1 C 2 


(3) 

(4) 


But = NCi and SFd 2 = NC 2 f and hence 


XFxiX2 = '2Fdid2 — NCiC 2 (5) 

and accordingly the formula for calculating r by the use of an 
arbitrary origin for Xi and an arbitrary origin for X 2 i^ 


ri2 


XFdid 2 - NC1C2 

N<Ti<T2 


( 6 ) 


Further saving in calculation results, however, if this formula 
is put in terms of class-interval units. In other words, the follow¬ 
ing form is more conveniently used:^ 


ri2 = 


2: 


Jff ^2 

il i2 il 12 


il 12 


(7) 


The correlation table serves as a work sheet for the calculation 
of the coefficient of correlation, as follows: 

^ When Cl = Xi = Ai + Ci. By definition Xi — Xi — Xi and 

di — Xi — A] so that oji = di + Ai — Ai — Ci = di — Ci. 

* The value of the numerator alone is SFX 1 X 2 ; if the problem is one in 
multiple correlation, it will be convenient to have a record of this value as 
well as the value of ri 2 . 
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1. An arbitrary origin is chosen for each variable; thus, in 
Table 34, Ai = 190 and A 2 = 190. The arbitrary origins are 
taken at the mid-points of a class interval about midway in the 
range of the distribution in order to reduce to a minimum the 
necessary computations. 

2. A column at the side and a row at the bottom of the cor¬ 
relation table are used to indicate, in class-interval units, the 
deviations of each variable from the respective arbitrary origins. 
This supplies entries for the rows under the caption di/ii and 
entries in the columns opposite the stub headings ^ 2 /^ 2 . In Table 
34, ii = 12 = 20. 

3. The next column at the side and row at the bottom of Table 
34 are for the purpose of entering the frequencies multiplied by 
the class-interval deviations. The sums of this column and row, 
respectively, are used in the calculation of the correction figures 
Ci/ii and € 2 / 1 ^ and in the computation of the means of the 
separate frequency distributions. The sums of the columns give 
the separate frequency distribution of X 2 , and the sums of the 
rows give the separate frequency distribution of Zi. 

4. The next column and the next row are for the frequencies 
multiplied by the class-interval deviations squared, in order to 
obtain sums from which to calculate the standard deviations of 
the respective variables. 

5. The means and standard deviations of the two variables 
are calculated as follows:^ 

Calculation of the means: 

Xi = 190 + W(20) = 190 + 27.40740 
= 217.4 

X 2 = 190 + |i(20) = 190 + 14.074 
= 204.1 

Calculation of the standard deviations:^ 

(g)’ = S - (0 = 

= 4.82579 
V: = 2.1968 

h 

= 43.94 

^ Using Eq. (3), p. 213. 

2 Using Eq. f5), p. 215. 
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02 

i2 


W “ (ll)* 6-08642 - 0.49619 

5.69123 

2.3646 


<r2 = 47.29 


6. The product of the deviations from the chosen arbitrary 
origins is obtained for each cell in the correlation table. This is 
obtained by multiplying the di/ii by the ^ 2/^*2 corresponding to 
the position of that cell. For example, for Zi = 100- and Z 2 = 
60-, the cell in the first colunm and third row of Table 34 there 
is a frequency of 2. According to the chosen arbitrary origins, 
this cell has a product deviation (in terms of class-interval units) 
of —6 multiplied by —4, or +24. Symbolically, this is (di/ii) 
{d 2 /i^^ The table is divided into four quadrants by lines 
through the Ai and A 2 . 

Two of these quadrants will have positive product deviations, 
and two will have negative product deviations. A product 
deviation is entered in each cell that contains frequencies and 
appears in the lower right corner of the cell. None are entered 
in the first quadrant because no frequencies occur in that quad¬ 
rant. Frequencies occur in only one cell in the third quadrant, 
that is, in the X\ = 200-, Z 2 = 160- cell, for which the deviation 
product is —1 multiplied by +1, or —1. 

7. The product deviation in each cell is multiplied by the 
frequency occurring in that cell, in order to obtain the proper 
number of product deviations of that particular cell. The 
product deviation occurs once in some cases and several times in 
others. Obviously, when it occurs several times the sum of the 
product deviations is obtained by multiplying by the frequencies. 
These figures are entered in each cell in the upper right corner. 
Symbolically, they are F(di/fi)(d 2 /f 2 ), for each cell. 

8. The sum of the figures calculated in item 7 is obtained, that 
is, the sum of the product deviations multiplied by their respec¬ 
tive frequencies. This is accomplished by adding the figures 
occurring in the upper right corner of each cell by rows and by 
columns and adding the sums of the rows or the sums of the 
columns to obtain the final sum. If both the latter are com- 
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puted, there will be a cross check on addition. Symbolically, the 
final aggregate is ^ F 

9. The coefficient of correlation is calculated by the use of 
Eq. (7) shown above, as follows: 

Calculation of r: 


455 - Sm W 
81(2.19677)(2.36458) 

455 - 78.11111 _ 376.88889 
420.74964 420.74964 


+0.89576 


Lines of Regression and First-order Variances. All the values 
that are needed to find the lines of regression of Table 34 have 
now been calculated. There are two lines of regression for each 
correlation table—the first one represents the regression of Xi 
on X 2 and the second the regression of X 2 on Xi. Since r has 
been computed, the easiest formulas for calculating these two 
lines (in original units measured from the intersection of the 
means of the two variables as an origin and not in class-interval 
units) are as follows: 

/ <ri / o’2 

x'l = r — X 2 X 2 = r —xi 

<^2 O '! 

These equations can be expressed in the units of the original 
data, that is, the scale as originally formed rather than in devia¬ 
tions from the means, as follows: 


X'l - Xi = r-\X 2 - X 2 ) X' - X 2 = r^(Xi - Xi) 

0’2 O'! 

Calculation. For the problem illustrated, the lines of regres¬ 
sion are calculated as 


/ ^ 43.9354 

Xi — 0.89576 47 2915 




= 0.8322x2 = 0.9642x1 


By substituting XJ — for x[ and X'^ — X 2 for Xj, these 
equations may be written as follows: 

XI - 217.4 = 0.8322(X2 - 204.1) 

X[ = 0 . 832 X 2 + 47.58 
X '2 - 204.1 = 0.9642(Xi - 217.4) 

X'j = 0.964Xi - 5.55 

In this form the equations are more easily interpreted as 
prediction equations. The first equation says that when a 
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student has a grade of 100 in first-semester English the predicted 
grade in second-semester English is 83.2 + 47.6 = 130.8. The 
second equation says that when a student has a grade of 100 
in second-semester English the predicted grade in first-semester 
English is 96.4 — 5.55 = 90.8. 

The two lines of regression are shown in Figs. 105 and 106.^ 
In Fig. 105 line oa' represents the first line of regression, 

X[ = 0.832Z2 + 47.58 

The small crosses show the location of the means of the columns 
(calculated and shown in Table 34). It is to be noted that the 
line of regression follows the progression of the means of the 
columns. 

In Fig. 106, line bb' represents the second line of regression 
X 2 = 0.964Zi — 5.55. The small circles show the location of 
the means of the rows (calculated and shown in Table 34). 
It is to be noted that the line of regression follows the progression 
of the means of the rows. 

The scatter about each of the lines of regression, the first-order 
<r, is calculated by using the following formulas:^ 

0-1.2 = 0-1 Vl — ^12 0-2.1 = 0-2 -v/l — rlz 

In the problem illustrated, the first-order variances are 
calculated as follows: 

<71.2 = 43.94(0.44453) < 72.1 = 47.29(0.44453) 

= 19.53 = 21.02 

(When r = 0.89576, = 0.44453.) 

In Figs. 105 and 106, which show the lines of regression, there 
are also shown the limits indicated by the first-order standard 
deviations. Between these limits, that is, the line of regression 
±<71.2 for Fig. 105 and the line of regression ±< 72.1 for Fig. 106, 
lie roughly two thirds of the frequencies, if it can be assumed 
that the population from which the sample is derived is normally 
distributed. This gives some idea of how accurate estimates 
based upon the lines of regression are likely to be. It is to be 

^See pp. 328, 329. 

* Calculation of ^^d 1 — is avoided by the use of J. R. Miner, 

Tables of \/l and 1 -- r® for Use in Partial Correlation and in Trigo¬ 
nometry or an ordinary table of sines and cosines, since sin x = — cos* x. 
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noted that all the means of the columns lie within the limits 
described by the first-order standard deviations and that all but 
two of the means of the rows lie within these limits. Each of 
the latter two means of rows, lying outside these limits (the first 
row and the next to the last row) is based upon only one student^s 
record. 

Progressions of Means, For these data the progressions of 
means have already been discussed in Chap. XIII. Figures 
105 and 106 show the means of the columns and the means of the 
rows plotted from the values computed in Table 34 and repro¬ 
duced in Table 26.^ Figure 105 represents the means of the 
vertical frequency distributions of Tables 25 and 34; it gives the 
progression of the means of Xi with changing values of Z 2 . Fig¬ 
ure 106 gives a similar analysis for the means of the rows. 

' See pp. 330 and 358. 



CHAPTER XV 
NONLINEAR CORRELATION 


All the foregoing discussion has been concerned with those 
cases in which the progression of the means is linear. In such 
cases it was found that r = 'ZxiX 2 /Nai(r 2 was an appropriate 
measure of correlation. If the progression of means and the 
distribution of cases around the means is as pictured in Fig. 113, 
however, r may show little correlation, especially in such cases 
as A and C, although there may be a high degree of association 
between the variables. It is the purpose of this chapter to 
indicate ways of describing and measuring such nonlinear 
correlation. 

As indicated in an earlier chapter, the best way of studying 
any correlation is to make a bivariate scatter diagram of the 
data. If the data are numerous enough to be grouped into class 
intervals, then the means of the rows and columns may be 
computed and the variation in the means of each variable with 
changes in the other variable may be studied. 

In the linear case in which a line of regression was used to 
measure the association it was found that, the smaller the 
scatter, the higher the degree of correlation, the equation being 


The same sort of formula may be used to measure the degree 
of relationship indicated by the progression in the means. To 
distinguish them from the correlation coefficient these measures 
are called ^^correlation ratios'^ and are defined by the formulas 


^2 — 1 — 

^12 — 9 

} ( 1 ) 

_2 _ 1 _ 

where Xe represents the means of Xi for various values of Xz, 

365 
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Xr represents the means of X 2 for various values of Xi, and 
(T%_.xr represent the sum of the squared deviations 
around the means pooled for all the column or row means and 


Ranges for 
calculdfing 



^2 



Fig. 113.—Illustrations of various kinds of nonlinear correlation. 


divided by N, Thus, 

2 _ c i 

-2 ^ r % 

<rx-Xr 

The correlation ratios and give some indication of the 
degree to which the means of one variable are successful in 
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measuring the variation in the other variable. They may be 
used to measure either linear or nonlinear correlation. 

If the means of one variable seem to mark off a definite curve 
or if in the case of ungrouped data a bivariate chart indicates a 
fairly definite form of nonlinear variation, th^n the average 
variation in one variable with changes in the other variable may 
be indicated by drawing a smooth curve or fitting one by some 
mathematical process, such as the method of least squares. 
Such a curve might be called a curve of regression.'^ A line 
of regression on the one hand indicates the average change 
in one variable with a unit change of the other variable; this 
average change is the same for all values of the independent 
variable, since the slope of a straight line is constant. A curve 
of regression on the other hand gives the average change in one 
variable with a unit change in the other variable; but this average 
change varies from one value of the independent variable to 
another, since the slope of a curve changes at each point. The 
technique of fitting a curve of regression will be discussed in a 
subsequent section. 

To measure the degree with which a curve of regression 
measures the association between two variables, an index of 
correlation is defined in a manner similar to the definitions of 
r and 77 . It depends on the closeness with which the various 
cases are scattered about the curve and is defined by the formula 


= 1 - 


= 1 - 


^Xl-Cl2 

crl 

^X2-C21 



where C 12 and C 21 refer to the regression curves, refers 

to the variance of the deviations from the curve of regression of 
Xi on Xzj and refers to the variance of the deviations 

from the curve of regression of X 2 on Xu 

Although ri 2 = r 2 i, the two correlation ratios and the two 
indexes of correlation are not necessarily equal. That is, 
V12 V21 and 7 i2 I 21 . In addition, rj > I > r. 

Since the variance about the means or about a curve is never 
greater than the total variance, these formulas always give a 
positive value and their square roots are indeterminate as to 
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sign. The square roots of and P give an index of correlation 
and the question as to whether it is a positive or negative rela¬ 
tionship must be answered by reference to a correlation table 
or a figure showing the polygon or curve of regression. In 
the case of curvilinear correlation, the question of positive or 
negative relationship often is irrelevant because two variables 
may be positively correlated up to a certain point and then 
negatively correlated beyond that point. Consequently, it 
becomes necessary to describe the entire relationship. For 
example, the death rate due to puerperal septicemia is correlated 
with ages of the female population in a nonlinear manner. The 
relationship between the two is best described by a curve or 
polygon of regression, which would have to be seen in its entirety 
if the relationship is to be completely understood. This is illus¬ 
trated in Fig. 55 (page 151). If r merely were calculated, it 
might conceivably'be zero when there is in fact a close relation¬ 
ship. An index of such a relationship is found by the calculation 
of the Tj^s or the 7’s. 

Calculation of the Correlation Ratio. The calculation of the 
correlation ratio will be illustrated by reference to the Mount 
Holyoke data in Table 34 (page 358). Although the relationship 
appears to be linear, it is worth while to compute the correlation 
ratio to see how close it comes to r. If the difference is not very 
great, the linearity will be numerically demonstrated. ** 

Equation (1) for the correlation ratios may be put in the form 


2 _ 0-1 - 
Vl2 -^ 

2 2 

_2 _ 

V 2 I - "2 - 

0^2 


• ^ 

( 3 ) 


where and are abbreviated expressions for cx-f, and 
r and thus represent the average standard deviations around 
the means of the columns and the means of the rows, respectively, 
as explained above. In order to apply these equations for finding 
the correlation ratios it is necessary to find the values of and 
(T®,.,. This can be most conveniently done with the help of a 
work sheet that makes use of arbitrary origins {Ai and A 2 ) and 


class-interval deviations 



Such a work sheet is 


Table 35 in which the computations are carried out for the data 
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of Table 34. The algebraic foundation for these computations 
is as follows: 

It is assumed that the same Ai is used for every column as 
for the total frequency distribution of Xi. Then for each 
column the sum of the squares of the deviation ffom the column 
mean would be^ 



For the sum of all columns this would be 


^ xr xr 



But 

m No N 

m No \ 

and by definition ss {Xio - XoY = ' 

1 1 

Therefore, - 



It has been determined already that^ 



^This follows from Eqs. (1) and (2) of Chap. VII. For it will be noted 

and = and 

*See Eqs. (1) and (2) of Chap. VII and previous footnote. 
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Each of the variances in Eq. (3) may be expressed in class- 
interval units so that its numerator is the arithmetic difference 
between Eqs. (5) and (4) and its denominator is Eq. (5). Thus, 


Vii “ 



( 6 ) 


Similarly it can be shown that for a table with rows 

Nr 

( 

I 


vii - — 


1 / Nr N 


DKiy 


(s-ty 


( 7 ) 


AT 


All the items in these two formulas (6) and (7) are to be found 
on the work sheet in Table 34 with the exception of 



These two figures are obtained from the correlation-ratio work 
sheet (Table 35). 

In Table 35, the frequency is placed in large type in the center 
of a cell. Each column is now regarded as a separate frequency 
distribution whose total number of cases Nc is shown in the 
row headed Nc. For each colunm the same arbitrary origin 
(Ai = 190) as that used in Table 34 is used; hence the same 
di/ii can be used for each column. 

For all 11 columns in the upper right comer of each interval 
that contains a frequency is a number in small type representing 



Table 35 .—Correlation-ratio Work Sheet 

Calculation of 1712 and 7721 between second-semester (Xi) and first-semester (X 2 ) grades of 81 Mount Holyoke freshmen 
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110.00 110.00| 152.861 178.00 195.00 
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d 

the F for that interval of the column. These are then summed, 

giving for each column ^ ^ ^ ^^ch of these sums is shown 

1 

Ne 

ypi‘. 

Z-/ 


in the row with the stub title 


If each sum is divided 


by the number of cases in the column Nc and multiplied by i, 
the resulting number is the correction factor Cc for that column. 
Accordingly, the mean for that column (Xc) can be found by 
using the formula = Ai + Cc. The results of this calcula¬ 
tion are shown in the row writh the stub title — and the 

column means are shown in the row with the stub heading Xc. 

In order to obtain the figure to be used in the formula for the 
correlation ratio—^that is, for the square root of —another 
row of figures is now added to Table 35; this set of figures con- 


N, 


sists of the (s-ty /Nc for each column; and when these 

are summed for all columns (say for columns), the resulting 
figure is as follows: 

Nc 

AZ'f)' - 

1/ Nc 
1 


= 473.1215 


Using Eq. (6), the correlation ratio of i on X 2 is thus* 

81 


473.1215 - 


mt 


543 - 


81 


_ 473.1215 - 152.1111 _ 321.0104 
543.0000 - 152.1111 390.8889 

' = 0.82123 
Viz ” 0.9062 

* The values = 111 and ^ F = 543 are found in Table 34. 

In that table the same d.i was used for the frequency distribution of Xi. 
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To calculate the means of the rows and the correlation ratio 
of on Xi, every row of Table 35 is treated as a separate 
frequency distribution. The same ^2 is used for each row as 
the A^m Table 34 for the entire X 2 distribution. ^Accordingly, 

the same set of d^/i^ may be used for each row. The F ^ for 

^2 

each interval of each row is placed in small type in the lower 
right corner of the interval. These summed for each row give 


Nr 


the 



shown in the column with that title heading. 


From 


1 

these are obtained the Cr for each row, by the same procedure 
as that used for finding the column means. For each row, the 


Nr j 

/ Nr is then computed and entered in the column 


with the title (ly/Nr- The sum of these for all row frequency 
distributions (say I rows) constitutes the aggregate 



436.2445 


This is the value required by Eq. (7) for finding 1721 . Thus,^ 

( 57)2 


436.2445 - 


V21 = 


81 


493 - 


( 57)2 

81 


436.2445 - 40.1111 396.1334 


493.000 
= 0.87467 
rj2i = 0.9352 


40.1111 452.8889 


The Correlation Ratio and Analysis of Variance. The square 
of the correlation ratio is a measure of the proportion of variance 
due to correlation, in the same manner as it was indicated that 

1 The values for ^ F ^ =57 and ^ F = 493 are found in Table 34. 

In that table the A 2 is the same as the A 2 used in the present table. ' 



374 STUDY OF BIVARIATES AND MULTIVARIATES 


the square of the coefficient of correlation is a measure of pro¬ 
portion of variance due to correlation. 

As has been explained, when expressed in the form rVj = 
the square of the coefficient of correlation reveals itself as the 
proportion of the total variance that is due to correlation or 
association with X 2 as measured by the line of regression of Xi 
on X 2 . In a similar manner, and likewise rihal = 

The square of the correlation ratio thus describes the proportion 
of the total variance that is due to correlation as measured by 
the fluctuations in the means of the columns and rows. The 
standard deviation of the means of the columns squared is the 
variance that is due to correlation of Xi with X 2 and similarly 
for the correlation of X 2 with Xu 

To demonstrate algebraically that rjl^crl = — o-J, 2 , it 

is necessary first to note that by definition the mean of the 
weighted means of the columns equal Xu By definition, 


N, 



and thus 



which, if summed for all columns, becomes 

m m Ne 

1 11 

But 

m Ne 

1 1 

and hence, if Eq. (8) is divided by N, it is equivalent to 


( 8 ) 



SXi 

N 


(9) 


which was to be proved. 

K Xi, the mean of the entire Xi distribution, is now selected 
as the arbitrary origiu. 


+ Cc or = Xc - Xi 
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Also, when the mean of the entire distribution is selected as 
the arbitrary origin for each column, the standard deviation 
of the column is found by 

Ne 

•r* = ^ - c? [xl = (X^ - ^i)»] 

On substituting Cc = Xc — Xi and transposing, an expression 
for each column similar to the following will result: 

Nc 

% 

(«) + (^c - X,y 

Multiplying the equation for each column by its respectively, 
will result for each column in 

Nc 

(b) X + NciX, - x^y 

1 

When the whole series, one for each column, of equations such as 

(b) are totaled, the following result is obtained: 

m Ne m m 

(c) X S + I) 

11 1 1 

But, in this equation, 

m Nc N 

= Xxl = Na\ 


iNA = 

1 

Moreover, by definition, the explained variance, that is to say, 
the variance of the means of the columns about the weighted 
mean of these means, is as follows: 

m 

n 4. = - Xiy = X Nc(Xc - Xiy 

1 

Consequently, (c) may be written 

Na\ = Nal, + N4, 
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or 


^2 _ ^2 
^1 


J 2 

+ 0-i 


Substituting the value of (tJ, , = (tJ — (r|^ in Eq. (3) for the 
correlation ratio gives the following: 


= Viiol 


Similarly, it can be shown that 


„2 


( 10 ) 


( 11 ) 


Calculation of Curvilinear Regression. To illustrate the statis¬ 
tical problem involved in curvilinear regression and the calcula- 


Table 36.— Stocks, Production, and Imports op Cotton and Price 
OF Cotton Received by Producers in the United States 
Stocks at beginning of crop year plus yearns production plus net imports. 
Prices are deflated by United States index of wholesale prices for crop years. 


Year 

beginning 
Aug. 1 

Deflated 
average price, 
cents per pound 

Xi 

Stocks, 

production, and 
imports, ten 
billion bales 

Xi 

1920-1921 

13.47 

1.726 

1921-1922 

18.06 


1922-1923 

22.63 


1923-1924 

29.30 

1.274 

1924-1925 

22.63 


1925-1926 

19.19 


1926-1927 

12.92 

2.195 

1927-1928 

20.95 

1.711 

1928-1929 

18.71 

1.749 

1929-1930 

18.34 

1.755 

1930-1931 

12.13 

1.862 

1931-1932 

8.38 

2.378 

1932-1933 

10.30 

2.307 

1933-1934 

14.04 

2.162 

1934-1935 

15.76 

1.765 

1935-1936 

13.83 

1.815 

1936-1937 

14.48 

1.821 

1937-1938 

10.29 

2.382 

1938-1939 

11.17 

2.383 
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tion of the correlation index /, data on cotton stocks, production, 
and imports compared with cotton prices, 1920-1939, have been 
selected. They are shown in Table 36 and plotted in Fig. 114. 

The position of plotted bivariates in Fig. 114 suggests that a 
curve such as aa' might fit the data. The question of the type of 
curve fitted is of particular importance in curvilinear regression 
and accordingly three types will be discussed for illustrative 
purposes. 



Fig. 114.—Bivariate scatter diagram and fitted curve showing relationship 
between the price of cotton and the supply of cotton. 

Logarithmic Regression, The constant slope of a straight 
line depicts the fact that the change in Xi is constant for a 
given quantity of change in Z 2 , and vice versa. The changing 
slope of a curve depicts the fact that change in. Xi varies for 
different values of Z 2 , and vice versa. One such curvilinear 
relationship between Xi and X 2 is as follows: 

XiXl = k (12) 

In Eq. (12) the varying manner in which Xi fluctuates with 
respect to X2 depends on the exponent b. If b is larger than 1, 
a small change in X2 must produce a large change in Xi because, 
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as the equation indicates, their product (when is raised to 
the 6 power) is constant. If 6 is equal to 1, the changes in Xi 
must be just proportionate (in an inverse manner) to the changes 
in X 2 . lib is less than 1, the changes in Xi must be proportion¬ 
ately less than the changes in X 2 . If such an equation is used 
to describe the line of regression of price of cotton on stocks 
and production of cotton, a very flexible price of cotton will 



Fig. 115.—The relationship of Fig. 114 in logarithmic form. 


result in a value of b larger than 1; a very inflexible price of 
cotton will result in a value of b less than 1. The nature of 
Eq. (12) assumes that the flexibility in the price of cotton remains 
the same regardless of stocks and production, because it sets up 
the hypothesis that the product equals a constant. 

If such an equation is assumed to be suitable for the problem 
in hand, the fitting of the curve of regression may be simplified 
by first transforming the equation to its logarithmic form, namely, 

log Xi + b log X 2 = log k 

= a if log fc = a (13) 
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Figure 115 shows the effect of transforming the bivariate fre¬ 
quency distribution from original units to logarithmic units. 
The data plotted are the same as the data plotted in Fig. 114, 
except that, in Fig. 115, the X\ and scales refer to the log¬ 
arithms of Xi and X^. When the bivariate logarithms shown in 
the first two columns of Table 37 are plotted in Fig. 115, a straight 

Ti»BLE 37.— Logarithms op United States Production, Stocks, and 
Imports of Cotton and of the Price of Cotton Received by Producers 
With columns for the squares of the logarithms and their cross 'products 
Xi = price of cotton 

X 2 = stocks, production, and imports of cotton 


log Xi 

log Xi 

log Xi log X 2 

log* Xi 

log* X 2 

1.1294 

0.2370 

0.2677 

1.2755 

0.0562 

1.2567 

0.1703 

0.2140 

1.5793 

0.0290 

1.3547 

0.1159 

0.1570 

1.8352 

0.0134 

1.4669 

0.1052 

0.1543 

2.1518 

0.0111 

1.3547 

0.1903 

0.2578 

1.8352 

0.0362 

1.2831 

0.2565 

0.3291 

1.6464 

0.0658 

1.1113 

0.3414 

0.3794 

1.2350 

0.1166 

1.3212 

0.2333 

0.3082 

1.7456 

0.0544 

1.2721 

0.2428 

0.3089 

1.6182 

0.0590 

1.2634 

0.2443 

0.3086 

1.5962 

0.0597 

1.0839 

0.2700 

0.2927 

1.1748 

0.0729 

0.9232 

0.3762 

0.3473 

0.8523 

0.1415 

1.0128 

0.3631 

0.3677 

1.0258 

0.1318 

1.1474 

0.3349 

0.3843 

1.3165 

0.1122 

1.1976 

0.2467 

0.2954 

1.4343 

0.0609 

1.1408 

0.2589 

0.2954 

1.3014 

0.0670 

1.1608 

0.2603 

0.3022 

1.3475 

0.0678 

1.0124 

0.3769 

0.3816 

1.0250 

0.1421 

1.0481 

0.3771 

0.3952 

1.0985 

0.1422 

S = 22.5405 

5.0011 

5.7468 

27.0945 

1.4398 


line fits the points. Thus the logarithmic transformation has 
converted a curvilinear correlation problem into a simple linear 
correlation problem in which the Pearsonian coeflicient of 
correlation is riogXi logXa regression of log Xi on 

log X 2 is as follows: 

log Xi — mean of log Xi = riog xi log za - (log X 2 

(^log X2 

— mean of log X 2 ) 
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The equations of regression could be obtained in the above form 
and then transformed into their antilogarithmic form; but in 
this problem it is more convenient to find the regression equation 
directly from the least-squares equations. Accordingly, the 
regression statistics a and h of Eq. (13) may be calculated by 
using the following least-squares equations:^ 

S log Xi = Na + log X 2 
S log Xi log Z 2 = log ^2 + 6S log^ X^ 

Table 37 is a work sheet providing columns to calculate S log Xi, 
2 log X 2 , 2 log Xi log X 2 , 2 log^ Xi, and 2 log^ Xg, using the 
data of the cotton problem for which the raw data are found in 
Table 36. The first two columns of Table 37 show the logarithms 
of the price of cotton in the United States and of the stocks, pro¬ 
duction, and imports of cotton. The third column contains the 
cross products of the logarithms. The fourth and fifth columns 
contain the squares of the logarithms in the first two columns. 
The sums of the columns provide the values that are required to 
find the regression statistics a and 6, for Eq. (13). 

Calculation of the regression of log Xi on log X 2 : 

22.5405 = 19a + 5.00116 
5,7468 = 5.0011a + 1.43986 

In order to solve, eliminate a by multiplying the second equation 
by 3.7992 and subtract it from the first, as follows: 

22.5405 = 19a + 5.00116 
22.8332 = 19a + 5.47016 
0.7073 = -0.46906 
6 = -1.5081 

Substituting this value of 6 in either of the equations will show 
that 

a = 1.5833 

Accordingly, the equation of logarithmic regression of log Xi 
on log X 2 is as follows: 

log Xi = 1.5833 - 1.5081 log X 2 
which may be transformed into antilogarithmic form as follows: 
XiXri-6o«i == 38.31 

. ^ See p. 333. 



NONLINEAR CORRELATION 


381 


Reciprocal Regression. Reciprocal regression is a special 
form of the type of regression indicated by Eq. (12); for if 6 = 1, 
changes in Xi are related reciprocally to changes in X 2 . In 



Fig. 116.—The relationship of Fig. 114 in reciprocal form. 


other words, the equation becomes 

X 1 X 2 = k' or 4- = (14) 

which, placed in a more general form, is as follows: 

= a + hX^ (15) 

If the reciprocal of each Xi is found, it is possible to find the 
equation for the reciprocal regression by fitting a straight line 
to X 2 and the reciprocal of Xi, that is to say, by fitting an equa¬ 
tion such as (15). Figure 116 shows the effect of transforming 
one of the variables of the bivariate frequency distribution from 
original units to reciprocal units. In the figure the vertical 
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scale is 1/Xi while the horizontal scale remains X 2 . When the 
bivariates shown in Table 38 are plotted in Fig. 116, a straight 
line fits the points. Thus the reciprocal transformation has 
converted a problem in curvilinear correlation into a problem 
in simple linear correlation in which the Pearsonian coefiicient 
of correlation is 7*1 and the line of regression is as follows: 

— Xi 

XI 

1 - . 

4- -X± =ri^,-^(X,-X,) 

Ai Xi XI - 0’2 


Table 38.—United States Supply of Cotton and the Reciprocal op 
THE Price of Cotton Received by Producers 
With columns for the squares and the cross 'products 
X\ — price of cotton 
X 2 — supply of cotton 


Xi 

1 

Xx 



1 

Xi* 

1.726 

0.07424 

0.12814 

2.97908 

0.00551 

1.480 

0.05537 

0.08195 

2.19040 

0.00307 

1 306 

0.04419 

0.05771 

1.70564 

0.00195 

1.274 

0.03413 

0.04348 

1.62308 

0.00116 

1.550 

0.04419 

0.06849 

2.40250 

0.00195 

1.805 

0.05211 

0.09406 

3.25803 

0.00272 

2.195 

0.07740 

0.16989 

4.81803 

0.00599 

1.711 

0,04773 

0.08167 

2.92752 

0.00228 

1.749 

0.05345 

0.09348 

3.05900 

0.00286 

1.755 

0.05453 

0.09570 

3.08003 

0.00297 

1.862 

0,08244 

0.15350 

3.46704 

0.00680 

2.378 

0.11933 

0.28377 

5.65488 

0.01424 

2.307 

0.09709 

0.22399 

5.32225 

0.00943 

2.162 

0.07123 

0.15400 

4.67424 

0.00507 

1.765 

0.06345 

0.11199 

3.11523 

0.00403 

1.815 

0.07231 

0.13124 

3.29423 

0.00523 

1.821 

0.06906 

0.12576 

3.31604 

0.00477 

2.382 

0.09718 

0.23148 

5.67392 

0.00944 

2.883 

0.08953 

0.21335 

5.67869 

0.00802 

S = 35.426 

1.29896 

2.54365 

68.23983 

0.09749 


The equations of regression could be obtained in the above form 
and then transformed into their original units, but in this 
problem it is more convenient to find the regression equation 
directly from the least-squares equations. The normal least- 
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squares equations are as follows: 

Table 38 is a work sheet with columns in which the required sums 
are obtained. Entering these sums in the above least-squares 
equations makes it possible to evaluate the regression statistics 
a and h for Eq. (15). 

Calculation of the regression of 1/Xi on X 2 : 

1.29896 = 19a + 35.42605 
2.54365 = 35.4260a + 68.239836 

Multiplying the first equation by 1.8645263 and subtracting the 
result from the second equation eliminates a and gives a solution 
for 6 as follows: 

6 = 0.05564 

Substituting this value in either equation gives the solution of a 
as follows: 

a = -0.03538 

The equation of regression is therefore as follows: 

4- = -0.03538 + .05564X2 

This equation describes the straight line plotted in Fig. 116. 
Plotted on scales of Xi and X 2 , the equation is a curve. 

Parabolic Regression, The curvilinear relationships so far 
considered have been relationships that could readily be trans¬ 
formed to a linear form, by taking logarithms or reciprocals. 
Such transformations reduced the problem to one of simple 
linear correlation between the transformed variables, and there 
was little in the analysis that was different from that of the 
previous chapters. A curve that cannot easily be transformed 
to a linear form is the parabolic relationship 


Xi = a + h,Xl + h^Xl 


(16) 
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This must be fitted directly. Fortunately, the nature of 
the curve is such that the method of least squares can be used. 
According to this, to fit a parabolic regression the least-squares 
equations are obtained as follows: 

The least-squares criterion is that 

= S(Xi — = minimum 

or 

S(Zi — a — & 1 Z 2 — 62 X 2 ) = minimum 

For this to be a minimum its total differential should be equal 
to zero; that is, differentiating with respect to a, 61 , and 62 and 
setting equal to zero, the following normal equations are obtained: 

2X1 = Nd “b 612X2 “b 622X2 

2 X 1 X 2 = a2X2 + 6i2X2 + 62 SX^ 

2 X 1 X 2 = a2X2 4- 6i2X| + 622X^ 

Table 39 is a work sheet providing for the calculation and 
checking of the sums entering into the three parabolic equations 
of regression. Using the sums of the appropriate columns 
the following set of equations is obtained for the calculation of 
the regression statistics a, 61 , and 62 for the regression of Xi on X 2 , 
shown in Eq. (16): 

306.58 = 19a + 35.4266i + 68.239862 (l) 

542.7359 = 35.426a + 68.23986i + 135.474462 (II) 
994.4092 = 68.2398a + 135.47446i + 276.397462 (III) 

The solution of three equations for three unknowns should be 
undertaken in an orderly manner; this is attempted in Table 40, 
which is a work sheet following the so-called Doolittle method. 
This work sheet provides a step-by-step check on the calculations 
as the solution of the equations proceeds. In order to avoid 
copying a, 61 , and 62 each time an equation is written down, a, 61 , 
and 62 are written as the titles of columns in which their coeffi¬ 
cients are entered. In the table only the coefficients are entered 
in their respective columns with the proper sign before each figure. 
For example, row (1) of the table is presumed to read as follows: 

19a + 35 . 426 O 61 + 68.239862 - 306.58 = 0 

which is the first equation above with slight rearrangement of 
terms. 



Table 39. —United States Supply op Cotton and the Price of Cotton Received by Producers 
With columns for the second, third, and fourth powers and for the necessary cross products to fit parabolic regressions 

X\ = price of cotton 
X 2 = supply of cotton 
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Table 40. —Doolittle Wobk Sheet fob Calculating Thbee Regression Statistics for Curvilinear Correlation 
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* 8.2396 « 135.4744 - 127.23496. 

Row (11) is used to find 62; row (6) is used to find 61 after 62 is found; row (2) is used to find o after 62 and bi are found, from the equations 
Row (11), -I.OOOO62 + 7.9944 = 0 

Row (6), -l.OOOOfex - 3.7673362 - 13.2095 = 0 

Row (2), -1.0000a - 1.8645266i - 3.5915762 + 16.1358 = 0 
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Three steps are involved in solving three equations for three 
unknowns: ( 1 ) to get an equation in the three unknowns in which 
the coefficient of one of the unknowns is unity, ( 2 ) to get an 
equation in only two of the unknowns in which the coefficient 
of one of the two is unity, and (3) to get an equation in only one 
of the unknowns in which its coefficient is unity. When the 
third step is accomplished, the value of the third unknown is 
obtained. This value, applied in the equation obtained by the 
second step, makes it possible to evaluate the second unknown; 
and the third unknown is then obtained by applying these two 
values in the equation obtained by the first step. This is the 
same process as that used for finding two unknowns from two 
equations. 

Table 40 provides an orderly procedure and also a check 
for these steps. The first step is accomplished in row ( 2 ) of 
the table, by multiplying Eq. (I), copied in row (1), by the 

negative reciprocal of the coefficient of a, that is, by this 

will make the coefficient of a become —1. The second step, 
rows (3) to ( 6 ), eliminates a from two of the equations in order 
to obtain in line (5) an equation in bi and 62 . In order to 
eliminate a, the first equation must be divided by its own coeffi¬ 
cient of a and multiplied by the coefficient of a of Eq. (II); in 

other words, Eq. (I) must be multiplied by --The 

multiplier is given a negative sign so that, when added to Eq. (II), 
the a term will cancel. Row ( 6 ) divides row (5) by the negative 
reciprocal of the coefficient of 61 in row (5). The third step, rows 
(7) to (11), accomplish the elimination of two of the variables, 
ending with an equation in only one of them, which of course gives 
its value. In order to do this, Eq. (Ill) is copied in row (7); 
Eq. (I) is multiplied by a number that will give it a coefficient 

_gg 2398 

of a equal to the coefficient of a of Eq. (Ill), that is, by-jg- f 

and this is entered in row ( 8 ); then the equation obtained in 
row ( 6 ) (in terms of only bi and 62 , d having been eliminated) is 
multiplied by a number that, combined with the two coefficients 
of 61 in rows (7) and ( 8 ), will give a sum of zero. The sum of 
rows (7) to (9) will then eliminate both a and bi, giving in row ( 10 ) 
such an equation. When row (10) is multiplied by the negative 
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reciprocal of its coefficient of 62 , the value of 62 is obtained; this 
is shown in row ( 11 ). 

A column for sums is provided in order to obtain a step-by- 
step check on all calculations. This is done by applying to 
the sums the same multipliers as those applied to the equations. 


For example, the sum of row (1) multiplied by should equal 


the sum of row ( 2 ). In the column headed Checks are entered 
the products obtained by multiplying the sums as indicated 
under Remarks to visualize the checks. 

From Table 40, the values of a, 61 , and 62 , are obtained from 
equations in rows ( 2 ), ( 6 ), and ( 11 ), as follows: 


Row (2), -a - 1.8645266i - 3.5915762 + 16.1358 = 0 
Row ( 6 ), -61 - 3.7673362 - 13.2095 = 0 

Row (11), -62 + 7.9944 = 0 

62 = 7.9944 

61 = -3.76733(7.9944) - 13.2095 
= -43.327 


a = 43.327(1.864526) - 7.9944(3.59157) + 16.135789 
= 68.2077 


The equation of regression of Xi on X 2 is therefore as follows: 

Xi = 68.2077 - 43 . 327 X 2 + 7.9944X2 

Estimates Based on Regression Equations. Using the three 
equations of regression calculated above for the regression of 
Xi on X 2 , that is, for the regression of the price of cotton on 
production, stocks, and imports of cotton in the United States, 
estimates may be made of the price that will result from a given 
volume of stocks plus production plus imports. The three 
equations are as follows: 

Logarithmic regression, logXi = 1.5833 — 1.5081 logX 2 
Reciprocal regression, = —0.03538 + 0.05564X2 

Parabolic regression, X'l = 68.2077 — 43.327X2 + 7.9944X| 

To illustrate the method of estimation, suppose the questions 
are asked: What is the expected price of cotton if the cotton 
stocks plus the year’s production and imports amount to 25 mil¬ 
lion bales? What is the expected price of cotton if the cotton 
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stocks plus the year’s production and imports amount to 22 
million bales? 19 million bales? 16 million bales? 13 million 
bales? Only 10 million bales? How much higher will the 
price be in a year of shortage than in a year of large carry-over 


Table 41.— Estimates op Cotton Prices Based on Three Regression 

Curves 

Estimates based on logarithmic regression 


Values of 
X2 

log X 2 

Equation of estimate 

1.6833 - 1.5081 log X 2 = log Xi 

log Xi 

Estimate of 
Xi 

2.5 

0.39794 

1.5833 - 1.5081(0.39794) = 

0.98317 

9.62 

2.2 

0.34242 

1.5833 - 1.6081(0.34242) = 

1.06690 

11.67 

1.9 

0.27875 

1.5833 - 1.5081(0.27875) = 

1.16292 

14.55 

1.6 

0.20412 

1.5833 - 1.5081(0.20412) = 

1.27547 

18.86 

1.3 

0.11394 

1.5833 - 1.5081(0.11394) = 

1.46612 

29.25 

1.0 

0.00000 

1.5833 - 1.5081(0.00000) = 

1.58330 

38.31 


Estimates based on recivrocal regression 


Values of 

X 2 

Equation of estimate 

-0.03538 + 0.05564X2 = 4" 

Xi 

1 

Xi 

Estimate of 
Xi 

2.5 

-0.3538 + 0.05564(2.5) = 

0.10372 

9.64 

2.2 

-0.3538 + 0.06664(2.2) = 

0.08703 

11.49 

1.9 

-0.3538 + 0.05564(1.9) = 

0.07034 

14.22 

1.6 

-0.3538 + 0.05564(1.6) = 

0.05364 

18.64 

1.3 

-0.3538 + 0.05664(1.3) = 

0.03695 

27.06 

1.0 

-0.3538 + 0.05564(1.0) = 

0.02026 

49.36 


Estimates based on parabolic regression 


Values of 

X 2 

Equation of estimate 

68.2077 - 43 . 327 X 2 + 7 . 9944 X 22 = Xi 

Estimates of 
Xi 

2.5 

68.2077 - 43.327(2.5) +7.9944(6.25) = 

9.85 

2.2 

68.2077',- 43.327(2.2) +7.9944(4.84) = 

11.58 

1.9 

68.2077 - 43.327(1.9) +7.9944(3.61) = 

14.75 

1.6 

68.2077 - 43.327(1.6) +7.9944(2.56) = 

19.35 

1.3 

68.2077 - 43.327(1.3) + 7.9944(1.69) - 

25.39 

1.0 

68.2077 - 43.327(1.0) +7.9944(1.00) = 

32.87 


and large production and imports of cotton? Table 41 shows 
how these estimates are made by using the above three equations 
of regression. When the year’s cotton stocks, production, and 
imports are large, the three regression equations give results 
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that are approximately equal to each other; but when the yearns 
cotton stocks, production, and imports are small, the estimates 
based upon the three regression equations differ sharply from 
each other. 

First-order Standard Deviation Used as Standard Error of 
Estimate. The dispersion about a curve of regression can be 
measured in the same manner as the dispersion of cases about a 
progression of means or a line of regression. The measure 
generally used is the standard deviation and is called a ^‘first- 
order standard deviationor a “standard error of estimate,'^ 
because it is the standard deviation of the residuals about the 
curves of regression by means of which estimates such as those 
illustrated in Table 41 are made. 

For the illustration in which cotton stocks, production, and 
imports are correlated with cotton prices compared with cotton- 
price correlation, three types of regression lines have been fitted, 
as follows: 

logX[ = a + felogX2 (A) 

^ = a + bX, (B) 

x{ = a + b^X^ + 62 X 1 (C) 

The standard error of estimate, being a standard deviation, is 
defined as follows: 

iVcrLs = Sd" (17) 

where each d is defined, taking regression type (C), for example, 
as 

d = Zi - X; = Xi - a - biX 2 - b 2 Xl (18) 

Hence, each d^ will be as follows: 

d^ = d{a - 61 X 2 - 62 X 2 ) = dXi - od - biX 2 d - b 2 Xld (19) 

If all these d^’s are added, the following result is obtained: 

2d2 = SXid - aSd - 6i2;X2d - b2^Xld (20) 

By the least-squares condition, however, the last three terms 
of Eq. (20) are equal to zero, for^ 

Sd = S(Xi - a - biX 2 - biX^) = 0 
SX2d = 2;X2 (Xi - a- biX2 - b2Xl) = 0 
SX^ = 2 X|(Xi - a - 61 X 2 - 62 X^) = 0 

‘ See p. 384. 
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Therefore, Eq. (20) reduces to the following: 


= SXid = SZi(Xi - a - hxXi - biXl) 
= SZ? - aSZi - 61SZ1Z2 - bi-SXiXl 


( 21 ) 


Accordingly, the formula for the square of the standard error of 
estimate is as follows: 


<’^ 1.2 


2X? - aSXi - bi^XiXi - biSXtXl 

N 


( 22 ) 


If regression type (B) were taken, it can be shown similarly 
that 


<^ 1.2 





(23) 


If the logarithmic regression equation is chosen, the standard 
error of estimate is found by a similar procedure to be as follows; 

2 _ S log2 Xi - aS log Xi - hX log Xi log X 2 

^1.2- 

The values necessary to calculate these standard errors of 
estimate are available, respectively, in Tables 40, 39, and 38. 

Calculation of standard error of estimate: For the logarithmic 
regression 

2 _ 27.0945 - (1.5833)(22.5405) - (-1.5081)(5.7468) 

<^1.2-19- 

_ 27.0945 - 35.6884 + 8.6668 _ 0.0729 
19 19 

= 0.0038 
o'i.2 = 0.06164 


By using the ordinary formula for the standard deviation, 

, 2X2 /sxV 

(when the arbitrary origin is taken as zero), the necessary figures 
are found in totals of the appropriate columns of Table 37, and 


^ The scatter formula for the logarithmic regression could be calculated 
by using the formula employed in the linear case, as follows: 

^1.2 = o-i(l ““ l^loe Xi]o8Xi) 

Since, however, the logarithmic r has not been calculated, it is simpler to 
use the formula based on the least-squares equations. 
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it is found that 

= 0.0187 
<ri = 0.1368 

For the reciprocal regression:* 

_ 0.09749 - (-0.03538)(1.29896) - (0.05564)(2.54365) 

CTl.* Jg 

_ 0.09749 + 0.04596 - 0.14153 _ 0.00192 
19 19 

= 0.000101 
0 - 1.2 = 0.01 

The standard deviation of 1/Xi is found by using the following 
formula 



The necessary values are found in the sums of the appropriate 
columns of Table 38. 

= 0.00869 

Xi 

= 0.0932 

X, 

For the parabolic regression: 

5,451.3758 - 68.2077(306.58) - (-43.327) (542.7359) 

, _ -(7.9944) (994.4092) 

Vl.2 jg 

_ 5,451.3758 - 20,911.1167 + 23,515.1183 - 7,949.7049 

19 

_ 105.6725 
19 

= 5.5617 
(T 1.2 ~ 2.3583 

Using the ordinary formula for calculating the standard deviation 
when zero is taken as the arbitrary origin, 

al = 26.5508 <fi = 5.1528 

1 The scatter formula for the reciprocal regression could be calculated by 
using the formula for the linear case, as follows: 

0^1.2 = o^ifl ^/ 1 \ J 

Since, however, the reciprocal r has not been calculated, it is simpler to use 
the formula based on the least-squares equations. 
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Table 42 is a summary of the estimates of cotton prices made 
above, together with ranges of plus and minus one standard 


Table 42.—Estimates and Ranges of Twice the Standard Error of 
Estimate for Cotton Prices Based on Three Regression Curves 
Estimates and ranges, logarithmic regression 


Estimated 
log of price 
log Ai 

Range 

logarithms 

Estimated 

price 

Xx 

- 

Range of 

price, antilogarithms 

log X\ + <ri .2 

log X\ — <ri .2 

0.98317 

1.04481 

0.92153 

9.62 

11.19 

8.35 

1.06690 

1.12854 

1.00526 i 

11.67 

13.45 

11.29 

1.16292 

1.22456 

1.10128 

14.55 

16.77 

12.63 

1.27547 

1.33711 

1.21383 

18.86 

21.73 

13.23 

1.46612 

1.52776 

1.40448 

29.25 

33.71 

25.38 

1.58330 

1.64494 

1.52166 

38.31 

44.15 

33.24 


Estimates and ranges, reciprocal regression 


Estimated 
reciprocal 
of price 

1 

Xi 

Range 

reciprocals 

Estimated 

price 

Xi 

Range of estimated 
price, converted from 
reciprocals 

1 , 

+ 0’l.2 

Ai 

1 

.Yl ■ • 

0.10372 

0.11372 

0.09372 

9.64 

8.79 

10.67 

0.08703 

0.09703 

0.07703 

11.49 

10.31 

12.98 

0.07034 

0.08034 

0.06034 

14.22 

12.45 

16.57 

0.05364 

0.06364 

0.04364 

18.64 

15.71 

22.91 

0.03695 

0.04695 

0.02695 

27.06 

21.30 

37.11 

0.02026 

0.03026 

0.01026 

49.36 

33.05 

97.96 


Estimates and ranges, parabolic regression 


Estimated 

price 

Xi 

Standard error of 
estimate 

Xi + <n.2 

Xl — 0-1.2 

9.85 

12.21 

7.49 

11.58 

13.94 

9.22 

14.75 

17.11 

12.39 

19.35 

21.71 

16.99 

25.39 

27.75 

23.03 

32.87 

35.23 

30.51 


error of estimate. In the cases of the logarithmic and reciprocal 
regressions these ranges are converted into original units of 
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the data in order to show their significance. The differences 
are notable. For the lower levels of price, the reciprocal 
regression gives estimates with small standard errors of estimate, 
but for the higher price levels the standard error of estimate is 
smallest with the parabolic regression. Each of these methods of 
calculating regression curves assumes that the variance in Xx 
is the same for all subgroups ol Xi associated with varying 
values of X 2 . The logarithmic regression assumes that, when 
converted into lo^rithms, the variance about the logarithmic 
regression is equal at all points but that, when converted into 
antilogarithms, it will be larger for the higher prices. The 
reciprocal regression assumes equal variance about the curve in 
terms of reciprocals but, when converted, the variance about the 
higher prices is larger than the variance about the lower prices. 

The question suggests itself: Which one of these three assump¬ 
tions about the character of variance about the curves of regres¬ 
sion best suits the data of the particular problem? This question 
is answered by determining which of the regression curves is the 
best fit for the data in question. 

Correlation Index. For each of. the curves of regression 
calculated in the previous section, a corresponding index of 
correlation will help to determine which of the regression curves 
is the best fit for the data. The standard error of estimate 
measures the divergence of the bivariates from the curve of 
regression; the correlation index measures the goodness of fit 
of the curve of regression. The indexes of correlation may be 
calculated by using Eq. (2). 

Calculation of Indexes of Correlation: For the logarithmic 
regression: 

Pn 

I12 

For the reciprocal regression: 

p __ 0.00869 - 0.00010 ^ 0.00859 
“ 0.00869 0.00869 

= 0.9885 
hi = 0.9942 


_ - <^1.2 _ 0.0187 - 0.0038 _ 0.0149 

fff 0.0187 0.0187 

= 0.7968 
= 0.8926 
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For the parabolic regression: 

„ _ 26.5608 - 5.5617 ^ 20.9891 
26.5508 26.5508 

= 0.7905 
In = 0.8891 

The high correlation index obtained for the reciprocal regres- 
sion appears to indicate that the cotton supply and price data 
for the period 1900 to 1940 are correlated in a reciprocal manner. 
It indicates that the sample data are fitted by the reciprocal curve 
of regression better than by either the logarithmic curve or the 
parabolic curve. 

It is to be noted that, in general, the use of the index of correla¬ 
tion to show which curve is the best fit is valid only when all 
curves have the same number of regression statistics. Here two 
curves had two regression statistics and one had three. A curve 
with a larger number of regression statistics will always give a 
better fit than a similar curve with a smaller number of regression 
statistics. Here, however, the parabola that had three regres¬ 
sion statictics gives a worse fit than either the logarithmic 
or the reciprocal curve, each of which has only two regression 
statistics. 

The Index of Correlation and Analysis Variance, As already 
pointed out, in the cases of the logarithmic and reciprocal curves 
of regression, the Pearsonian coefficient of correlation may be 
calculated. When transformed into original units, this coefficient 
of correlation becomes the index of correlation. In the problems 
above illustrated, however, the correlation index was calculated 
instead by using the general formula based upon the scatter 
because the arithmetic involved in the latter method is simpler. 
In logarithmic and reciprocal units, respectively, the coefficient 
of correlation squared is, for these curves of regression, a coeffi¬ 
cient of proportional variance just as is the r^ for simple linear 
correlation problems. 

For the parabolic curve of regression, the deviations from the 
curve of regression may be described as 

Xi - a; = d and X[ = Xi - d 
If these are added for the entire data, the result is 
SXJ = SZi - Sd 
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and since Sd = 0 it follows that 

sz; = SZi = NXx 

and hence the mean of Xj equals the mean of Zi. 

Consequently, the sum of squares of X[ may be obtained as 
follows: 

= 2Z'f - NXl (25) 

In Eq. (25), SZj^ may be evaluated as follows: 

SZi" = S(Zi - d)2 = SZ? - 2SZid + Sd^ 

As shown above on page 391, SZid = Sd^. Therefore, 

SZi' = SZ? - Sd2 

and 

= SZ? - 2d2 - NX\ (26) 


However, it is true by definition that 

SZ? ~ NX\ = N<t\ and Sd^ = 

Therefore, Eq. (26) reduces to the following: 

- Z(r?.2 (27) 

From Eq. (27), by dividing by N and then by <t\ and transposing 
terms, it follows that 


2 2 

_ -t _ 1.2 

<rf orf 


(28) 


and from Eq. (28) it follows by definition of /Jg that 


Ih = 


4:i 


(29) 


Hence the square of the correlation index has the same significance 
as the square of the linear coefficient of correlation; it measures 
the proportion of the total variance accounted for by the assumed 
type of curvilinear correlation. 



CHAPTER XVI 


MULTIPLE AND PARTIAL CORRELATION 

To deal with the relationship between only two variables 
the method of correlation so far discussed is useful, but in the 
nonexperimental sciences it is frequently and indeed usually 
more important to be able to deal with the association between 
three or more variables. In the social sciences in particular, 
variations in practically every factor are related to variations 
in several other rather than in a single other factor. For exam¬ 
ple, variations in the price of cotton are related not only to 
changes in the production and consumption of cotton but also 
to changes in the prices of substitutes for cotton such as rayon 
and, in addition, to changes in the value of money. Again, the 
consumption of a commodity such as gasoline may depend more 
upon the number of automobiles in existence and upon the 
number of miles of hard-surfaced roads available for use than 
upon the price of gasoline. As a matter of fact, it is. dependent 
on all these factors and others too. In such cases it is essential 
to have some method of ‘‘multiple correlation^’ and “partial 
correlation.” 

Definitions of Terms. Multiple Correlation. Multiple corre¬ 
lation is an extension to more than two variables of the methods 
of simple correlation. Simple linear correlation provides a line 
of regression from which an average value for the depend¬ 
ent variable may be estimated if the value of the independ¬ 
ent variable is given. Multiple linear correlation provides at 
“plane” of regression by means of which an average value for 
the dependent variable may be estimated if the values of 
two or more independent variables are given. The plane of^ 
regression of the price of cotton on the price of rayon and on the 
wholesale price level, for example, would permit the estimation 
of the former from joint knowledge of the latter, instead of from 
the price of rayon alone. Similarly, the plane of regression 
of the second-semester English grade on the first-semester English 

397 
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grade and on the verbal scholastic-aptitude test grade would 
permit the estimation of the former from joint knowledge of the 
latter, instead of from only the first-semester English grade. 
The regression equation, accordingly, has two or more terms 
to the right instead of one; its general form is as follows: 

= ai.23 . . . + bl2.3 . . .^2 + ?>13.2 . . .^3 -j- • * * 

where Xi is the dependent variable, Z 2 , X 3 , etc., are the inde¬ 
pendent variables, and a and the 6 ’s are estimated parameters, or 
regression statistics, whose numerical values are determined in 
any particular case by the method of least squares. The numer¬ 
ical subscripts will be explained later. For the moment it only 
need be noted that a plane of regression is but the extension to 
more than two variables of the idea of a line of regression. 

In simple linear correlation, dispersion about the line of regres¬ 
sion of Xi on Z 2 serves as a measure of the accuracy of any 
estimate of Xi made from the line of regression. In multiple 
correlation, dispersion about the plane of regression serves as a 
measure of the accuracy of any estimate of the dependent variable 
made by reference to the plane of regression. One of the essential 
problems of multiple correlation is to calculate dispersion about 
the plane of regression. 

In simple correlation, a line of regression is merely a law of 
relationship between one variable taken as a dependent variable 
and another taken as an independent variable; it does not of 
itself describe the degree of relationship or association that exists. 
To measure the degree of linear association is the function of the 
coefficient of correlation. Since the coefficient of correlation 
measures the amount of linear association, it also serves as a 
measure of the goodness of fit of the linear-regression equation 
to the bivariate distribution and yields a measure of the general 
degree of accuracy of estimates made by reference to the regres¬ 
sion equation. In multiple correlation, the coefficient of multiple 
correlation serves the same general function. First, it serves 
as a measure of the degree of association between one variable 
taken as the dependent variable and a group of other variables 
taken as the independent variables. Hence, it also serves as a. 
measure of the goodness of fit of the calculated plane of regression 
and consequently as a measure of the general degree of accuracy 
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of estimates made by reference to the equation for the plane of 
regression. 

In simple linear correlation, relationships are completely 
described by two lines of regression, one in which Xi is taken 
as the dependent variable and the other in which X 2 is the 
dependent variable. In multiple correlation involving three 
variables, there are three planes of regression. If four variables 
are involved, there are four planes of regression, and so forth. 
In general, there are as many planes of regression as there are 
variables that may be taken as dependent variables, in short, as 
many ]jlanes of regression as variables. In particular cases, the 
intuitive sense of cause and effect may lead to the rejection of 
some of these possible planes of regression as being without any 
practical significance. They must always, however, be consid¬ 
ered as theoretical possibilities. 

Where only two variables are considered, the coefficient of 
correlation between X 2 , taken as dependent, and Xi, taken as 
independent, is the same as the coeflB.cient of correlation bet ween 
Xi, taken as dependent, and X2, taken as independent. 
measure of goodness of fit of the line of regression of X2 on Xi 
is the same as the measure of the goodness of fit of the line of 
regression of Xi on X 2 . This cannot be said of the various 
multiple-correlation coefficients. The multiple-correlation coeffi¬ 
cient that measures the degree of association between Xi, 
dependent, and X 2 and X 3 , independent, as a group and that also 
serves as a measure of the goodness of fit of the plane of regression 
of Xi on X 2 and X 3 is not the same as the coefficient of multiple 
correlation that measures the degree of association of X2, depend¬ 
ent, with Xi and X3, independent, taken as a group and that 
also measures the goodness of fit of the plane of regression of X2 
on Xi and X3. Furthermore, neither of these two coefficients is 
equal, except by mere chance, to the coefficient of multiple 
correlation that measures the degree of association of X3, depend¬ 
ent, with Xi and X2, independent, taken together and that also 
measures the goodness of fit of the plane of regression of X 3 on Xi 
and X2. In multiple correlation, there are as many different coef¬ 
ficients of multiple correlation as there are planes of regression. 

Linear vs. Nonlinear Relationships. The simplest form of 
correlation analysis rests on the assumption that the association 
between the variables is of a linear type. In some cases, this 
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assumption does violence to the facts, the association being 
clearly of a nonlinear form. Where a simple form of nonlinear 
relationship exists between two variables, it has been found 
possible to fit a curve of regression instead of a line of regression 
and to calculate a correlation coefficient that measures the good¬ 
ness of fit of this curve. Whether such a simple curve can be 
fitted or not, it is possible to calculate a measure of nonlinear 
relationship, called the ‘^correlation ratio,'' that depends on a 
comparison of the variation about the means of the columns 
(or rows) of the grouped data with the total variation in the 
data.^ 

Such devices as these can also be used when nonlinear rela¬ 
tionships exist among three or more variables. When the 
, nonlinear relationship takes a simple form, it is possible to fit 
a curved plane or a surface of regression. A multiple-correlation 
index / 1.23 can also be calculated to serve as a measure of the 
goodness of fit of this surface of regression. Whether a simple 
form of a curved surface can be fitted or not, it is always possible 
to calculate a multiple-correlation ratio of the same sort as the 
correlation ratio for only two variables. Similar nonlinear 
relationships can also be carried over into the analysis of partial 
correlation. 

Partial Correlation. Partial correlation is concerned with a 
concept resultii\g from the fact that more than two variables 
are correlated; if only two variables are considered, there is no 
place for partial correlation. Where there are three or more 
variables, however, the question of the interrelationships between 
the variables becomes a part of the analysis. How much of the 
apparent association between two variables (Xi and A 2 ) is due 
to their common association with a third variable (A 3 ) and how 
much to their direct connection or to some connection through 
other variables independent of A 3 ? Would Ai and A 2 continue 
to vary together if A 3 were held constant? This is the new 
problem that partial correlation attempts to solve. Fortunately, 
the methods employed iq its solution are the same fundamentally 
as those involved in siihple linear correlation. 

This chapter is primarily concerned with linear multiple and 
partial correlation involving three variables. The notation 
involved in multiple and partial correlation will first be sum- 

1 See Chap. XV. 
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marized. A brief discussion of a multivariate frequency dis¬ 
tribution, upon the basis of which any form of multiple or partial 
analysis must be based, will follow. Ensuing sections of the 
chapter will explain the fitting of planes of regression and will 
derive formulas for finding the numerical values of the regression- 
statistics of any given plane fitted by the method of least squares. 
Formulas for measuring dispersion and for calculating multiple- 
correlation coefficients will also be derived. Partial correlation 
will be explained in more detail, and methods of calculating 
partial-correlation coefficients will be indicated. In the next 
chapter the entire subject will be illustrated by an example. 

Notation. It is the practice in multiple- and partial-correla¬ 
tion analysis to let a symbol indicate the class to which a given 
quantity belongs and to denote by subscripts the particular 
number of the designated class. For example, if X stands for 
any variable measured in original units, Xi indicates a particular 
member of this group and its subscript distinguishes it from 
X 2 , A 3 , etc., which are members of other groups. In a designated 
problem, Xi may be the price of cotton, X 2 the price of rayon, 
and A 3 the general price level. Following is* a summary of the 
various symbols used in the subsequent analysis, in which special 
attention should be directed to the subscripts: 

A 1 , A 2 , A 3 Variables measured in original units 
A(, A 2 , Ag The estimated value of these variables given by the 
three regression equations in which the variables 
are taken as dependent. The primes distinguish 
them from the actual values of Ai, A 2 , and A 3 
xiy X 2 , Xs * Variables measured from their means as origins 
(xi = Ai — Ai, etc.) 

x'l, X 2 , x'z The estimated values of Xi, X 2 , and Xz given by their 
regression equations and measured from the 
means of Ai, A 2 , A 3 {x'l = AJ[ — Ai, etc.) 

Ai, A 2 , A 3 Means of Ai, A 2 , A 3 
O’!, o- 2 j (Tz Standard deviations of Xi, X 2 , Xz 

Xi X 2 Xz Variables measured from their means as origins 

O'/ W ^3 and expressed in terms of standard-deviation 

units 

x'l x '2 Xz x[, X 2 J Xz expressed in terms of the standard-devi- 
cTi (r% az ation units of Ai, A 2 , As 
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= « 1.23 + ^12.3X2 + ^13.2X3 
X 2 = ^ 2.13 + ^ 21 . 3 X 1 + h23.lX3 
X's = « 3.12 + h3l.2Xl + h32.lX2 

These are the equations for the three planes of 
regression in which the variables are measured 
in terms of original units. The a's and 6's are 
the regression statistics of the equations, of 
which the explanation follows: 

ai.23 The constant term in the regression equation in 

which Xi is taken as the dependent variable 
and X2 and X3 as the independent variables 
a2.i3 and as. 12 These are the constant terms, when X2 and 
X3, respectively, are the dependent variables. 
The subscript before the point refers to the 
dependent-variable number; the subscripts 
after the point refer to the independent vari¬ 
ables. The order of subscripts after the point 
is immaterial; that is, a2.i3 = O2.31 

612.3 The coefficient of X2 in the regression equation in 

which Xi is taken as the dependent variable 
and X3 is the other independent variable. The 
first number in the subscript indicates the 
dependent variable; the second number in 
the subscript indicates the variable of which the 
6 is a coefficient; the point followed by the 
other subscript indicates that a third variable 
is considered. Siihilarly, 613.2 is the coefficient 
of X3 in the same regression equation. It is to 
be noted that 612.3 ^ 621.3 

621.3 The coefficient of Xi in the regression equation in 

which X2 is taken as the dependent variable 
and X3 is the other independent variable; 
623.1 is the coefficient of X3 in the same regres¬ 
sion equation 

632.1 and 631.2 These have a similar meaning for the third 
regression equation 

xi = bl2.3X2 + bis.sXs 
x '2 = b21.3Xl + b23.lX3 
xi = 631.2^:1 + bz 2 . lX 2 
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Equations (2) are another form of the three regres¬ 
sion equations. Here the variables are ex¬ 
pressed in terms of deviations from their 
respective means. In these equations there 
are no a^s, or constant terms, because the planes 
of regression all pass through the point given by 
the means of the three variables. The 6^s are 
the same as those in Eqs. (1) 


O’] 0"2 O’s 

X 2 Q ^3 

— = P2I.3-T P23.I — 

0^2 (^1 <^3 

^ = 1831.2 -‘ + ^32.1 



Equations (3) give a third form in which the three 
regression equations may be written. Here 
the variables represent deviations from their 
respective means expressed in standard-devi¬ 
ation units [the x'^s are expressed in terms of the 
standard deviations of the x’s ( 0 * 1 , 0 - 2 , 0 - 3 ) 
instead of the standard deviations of the x'^s 

themselves]. The form is similar to ~ ^ 

<ri (Tz 

for two variables^ 

In Eqs, (3), the correspond to the h^s in the 
Eqs. (1) and (2). As may be seen by compar¬ 
ing Eqs. (2) and (3), the are related to the 6’s 
in the following way: 


hi2.s — Pi2.z — 

0-2 

^21.3 = jSzi.S ~ 

0-1 

631.2 = Pzi.2 — 

CTl 

613.2 = P 1 Z .2 — 

0-3 

^2Z.l = P 2 Z.I — 

<rz 

^ Z 2.1 = ^ 32.1 ~ 

0*2 

^ See pp. 34^351. 
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If the symmetry of these equations is noted, 
they are easily remembered; for example, the 
subscripts of the 6 ’s are the same as the sub¬ 
scripts of the jS’s, and the order of the first two 
subscript numbers describes the subscript for 
sigma in numerator and denominator, respec¬ 
tively. It is to be noted that the 1812.3 does not 
equal 1821 . 3 , etc. 

The scatter about the plane of regression of on X 2 
and .^^3 

The scatter about the plane of regression of X 2 on 
and Xz 

The scatter about the plane of regression of Xz on Xi 
and X 2 

The multiple-correlation coefficient between Xi on the 
one hand and X 2 and X 3 on the other 
The multiple-correlation coefficient between X 2 on the 
one hand and Xi and Xz on the other 
The multiple-correlation coefficient between Xz on the 
one hand and Xi and X 2 on the other 
The partial-correlation coefficient between Xi and X 2 
when X 3 is held constant. The position of the sub¬ 
scripts is more important than the noncapitalization of 
the r in distinguishing it from the multiple-correlation 
coefficients. The subscript after the point indicates 
which variable is held constant. r 2 i .3 = ^" 12.3 
The partial-correlation coefficient between Xi and Xz 
when X 2 is held constant 

The partial-correlation coefficient between X 2 and Xz 
when Zi is held constant 

Study of the symmetry in the above system of notation will 
make it easy to remember. With the exception of the notation 
for partial-correlation coefficients, the order of subscripts before 
the point is always significant; following the point it is always 
immaterial. 

MULTIVARIATE FREQUENCY DISTRIBUTION 

The monovariate frequency distribution, it will be recalled, 
is the basis for the determination of various measures describing 
the central tendency and variation about the central tendency 


0 ’ 1.23 

<^■2.13 

<^ 3.12 

Ri.2Z 

1 ^ 2.13 

R 3 .I 2 

ri2.3 

^ 13.2 

r23.1 
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of a single variable. The bivariate frequency distribution 
(Chap. XIII) is the basis for the calculation of the lines of 
regression and the simple correlation r as well as for the calcu¬ 
lation of correlation ratios. In fact, the bivariate frequency 
distribution contained all the information regarding the joint 
variation or covariance of Xi and X 2 and hence formed the basis 
for the calculation of any measure or law of relationship between 
these two variables, linear or otherwise. Similarly^ the multi¬ 
variate frequency distribution contains all the information about 





Fig. 117.—A trivariate frequency distribution. 


thfe covariance of Xi, X 2 , X 3 , etc., and it thus forms the basis for 
the calculation of any measure or law of relationship between the 
different variables, individually or in groups. 

Figure 117 shows a trivariate frequency distribution in which 
each variable is grouped into three class intervals. A small 
number of class intervals is taken in order to simplify the dia¬ 
gram; in any actual problem, the number of class intervals would, 
of course, be larger. 

The figure shows the frequency (the number written on the 
floor of each cubical cell) with which Xi falls within a given class 
interval at the same time that X 2 falls within another given class 
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interval and Xz falls within a third given class interval. Accord¬ 
ingly, in Fig. 117, the frequency with which Xi takes on values 
between 100 and 200 at the same time that ^2 takes on values 
between 1 and 2 and Xz takes on values between 5 and 10 is 10. 
This is the frequency of the joint occurrence of the specified 
Xi, X 2 , and Xz values. The frequencies in other cells repre¬ 
sent the frequency of joint occurrence of other Xi, X 2 , and Xz 
combinations. 

If the frequencies of this trivariate frequency distribution are 
projected upon any one of the three reference planes, that is, if 
the frequencies are added from top to bottom, from left to right, 
or from front to rear, a bivariate frequency distribution is 
obtained for two of the three variables. For example, if the 
frequencies are projected upon the X 2 X 3 plane, the bivariate 
frequency distribution for these two variables shown in Table 43 
is obtained. In Table 43 the frequencies of the trivariate fre- 

Table 43 

0 1 2 3 X 2 

1111 

5 


10 


15 

quency distribution in Fig. 117 are added from top to bottom. 

If the frequencies are projected upon the X 1 X 2 plane, the 
bivariate frequency distribution of Xi and X 2 shown in Table 44 

Table 44 
Xi 

300 


200 


100 


w I I 1 1 

0 1 2 3 X 2 

is found. To obtain the frequencies in Table 44, the frequencies 
in the trivariate frequency distribution (Fig. 117), are added 
from front to rear. 


1 

12 

16 

6 

20 

16 

7 

10 

5 


8 

8 

7 

5 

23 

13 

1 

11 

17 
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Finally, if the frequencies are projected onto the X 1 X 3 plane, 
the bivariate frequency distribution of Xi and X 3 shown in 
Table 45 is obtained. The frequencies shown in Table 45 are 
the sum from left to right of the frequencies in the trivariate 
frequency distribution shown in Fig. 117. 

Table 45 

3( 

2 ( 

1 ( 

- 0 

X 3 15 10 5 0 

In the three bivariate frequency distributions shown in Tables 
43 to 45, it is to be noted that X 2 and X 3 are positively corre¬ 
lated, as are also Xi and X 2 and Xi and X 3 . The given tri¬ 
variate frequency distribution (Fig. 117) is one in which all the 
variables are positively correlated with each other. In this 
case, as the values of X 2 and Xz both increase, the mean value 
of Xi also tends to increase; in other words, the plane of regres¬ 
sion of Xi on X 2 and X 3 would slope upward from the origin in 
both the X 2 and the X 3 direction. Because of the all-round pos¬ 
itive correlation between the variables, the other planes of regres¬ 
sion would also slope upward from the origin in both directions. 

The net regression between two variables in a multivariate 
distribution is measured by the b statistic, and it is possible to 
have a negative net regression 612.3 although the Pearsonian 
coefficient of correlation is positive, and vice versa. If ri 2 
is small compared with ri 3 and r 23 , the latter being either both 
negative or both positive, the plane of regression of Xi on X 2 
and X 3 may slope downward in the X 2 direction even if r \2 is 
positive. The statistic 612.3 is of the same sign as ri 2 so long as 
^12 — ri 3 r 23 is of the same sign as ri 2 . If this condition is not 
fulfilled, that is, if ri 2 — ri 3 r 23 and ri 2 are of opposite sign, 612.3 
will be opposite in sign to ri 2 and the plane will slope in the 
opposite direction from that indicated by the sign of ri 2 , which, 
when multiplied by the ratio (ti/(T 2 j describes the slope of the 
line of regression in the bivariate distribution of Xi and X 2 .* lu 

* See pp. 349-351. 
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the case where is positive but 612.3 is negative, the coefficient 
of partial correlation ri 2.3 is negative, agreeing with the sign of 
the 6 statistic. For this reason, the partial-correlation coefficient 
may be said to measure the net correlation between the two 
variables. 

If the net correlations between Xi and X 2 and between Xi 
and Xz are both negative, the plane of regression of on Z 2 
and Xz slopes downward in both directions. In this instance, 
the 612.3 and the 613.2 of the regression equation are both negative. 
In other words, the mean value of Xi would tend to decrease 
with increases in the values of both X 2 and X 3 . This particular 
plane of regression would have an all-round negative slope. If 
the net correlation between Xi and X 3 is negative, however, and 
that between Xi and X 2 is positive, the plane of regression of 
Xi on X 2 and X 3 slopes upward in the X 2 direction, that is, the 
mean value of Xi increases as X 2 increases; and the plane slopes 
downward in the Xz direction, that is, the mean value of Xi 
declines as X 3 increases. In this instance, 612.3 is positive, 
and 6 i 3.2 is negative. The plane of regression shows a positive 
relationship in one direction and a negative relationship in the 
other direction. 

These are a few of the possible forms that a trivariate fre¬ 
quency distribution might take. Others include nonlinear 
relationships. For example, the mean value of Xi might first 
increase as X 2 increases and also as X 3 increases and then later 
decrease as both these variables continued to increase, or Xi 
might decline in the X 2 direction after a certain point but con¬ 
tinue to rise in the X 3 direction. For either of these combina¬ 
tions, a curved plane or surface of regression would give a better 
fit than a straight plane. 

In order that there be all-round independence, that is to say, 
absolutely no correlation whatsoever, either linear or nonlinear, 
between any of the variables, the following conditions must exist: 

1 . The distribution of Xi for any given X 2 and X 3 class inter¬ 
vals, that is, the distribution of Xi values for any given vertical 
shaft, must be of the same form, it must have the same mean, 
the same standard deviation, etc., even though it does not have 
the same number of cases, as the distribution of Xi values for 
every other vertical shaft of the trivariate frequency distribution 
(see Fig. 117). 
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2. The distribution of X 2 values for any given Xi and X 3 
class interval, that is, the distribution of X 2 values in any given 
horizontal shaft parallel to the X 2 -axis and perpendicular to the 
XiXz plane, must be of the same form as the distribution of X 2 
values in every other horizontal shaft parallel to ^the Z 2 -axis 
(see Fig. 117). 

3. The distribution of X 3 values for any given Xi and X 2 
class interval, that is, the distribution of X 3 values in any given 
horizontal shaft parallel to the Xs-axis and perpendicular to 
the X 1 X 2 plane, must be of the same form as the distribution of 
Xz values in every other shaft parallel to the Xs-axis and per¬ 
pendicular to the X 1 X 2 plane (see Fig. 117). 

A close study of a multivariate frequency distribution is 
therefore- always desirable before attempting to calculate any 
measure of relationship. Since in some instances the net corre¬ 
lation may be of opposite sign from simple linear correlation, as 
illustrated above, the examination of separate bivariate dis¬ 
tributions for each pair of variables is not always a reliable 
method. It is better to undertake a study of the multivariate 
distribution. In a trivariate problem a diagram similar to 
Fig. 117 could be set up, but for a large number of class inter¬ 
vals it would be extremely difficult, if not impossible, to draw. 
The multivariate distribution can be studied, however, by select¬ 
ing all the Xi and X 2 variates associated with a given range of 
X 3 variates, for example, in Fig. 117, all the Xi and X 2 variates 
associated with values of X 3 from 5 to 10. In this manner, a 
series of frequency distributions of Xi for varying values of X 2 
is obtained. 

Table 46.—^Values op Xi and X2 Associated with Values op 
X3 BETWEEN 5 AND 10 
0 1 2 3 X2 

200 
100 
Xi 0 

Similar tables could be constructed showing the values of Xi 
and X 2 associated with values of X 3 between 0 and 5 and with 
values of X 3 between 10 and 15. In this manner, the net corre- 
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lation between Xx and X 2 can be studied; and, by a similar 
procedure, the net correlations between Xi and Xz and between 
^2 and Xz can be examined. If such a study should reveal that 
linear relationships prevail, the methods to be discussed in the 
ensuing sections could be applied. If simple curvilinear rela¬ 
tionships are apparent, some curved plane might better be fitted 
instead of a straight plane. In some instances, the latter could 
be accomplished by using logarithms, reciprocals, or some other 
transformation of the variables, to which linear functions could 
be fitted; in other instances, it might be necessary to fit parabolic 
functions to the original units. 

MULTIPLE LINEAR REGRESSION 

The a’s, 6 ’s, and /3’s of a linear plane of regression are calcu¬ 
lated in terms of given data or of quantities easily calculated 
from the data. The jS’s can be evaluated in terms of the simple 
correlation coeflicients, the r’s; knowledge of these therefore 
permits the immediate calculation of the former. The can 
be computed readily from the jS^s by multiplying by the proper 
ratio of standard deviations [see Eqs. (4)]. Finally, the a^s can 
be computed from the 6 ^s and the means of the different variables. 

The common method of evaluating the jS’s is the method of 
least squares. It was pointed out that for three variables there 
are three planes of regression. Values for the regression sta¬ 
tistics in the regression equation of Xi on X 2 and Xz are derived 
by minimizing the sum of the squares of the deviations of the 
actual values of Xi from those (Xj) given by the plane of regres¬ 
sion, that is, by minimizing 2 (Xi — Similarly, values for 

the regression statistics in the regression equation of X 2 on Xi 
and X 3 are derived by minimizing the sum of the squares of the 
deviations of X 2 from those (Xg) given by the second plane of 
regression, that is, by minimizing 2 (X 2 — Xa)^. Finally, the 
values of the regression statistics in the regression equation of 
X 3 on Xi and X 2 are derived by minimizing S(X 3 — Xa)^. All 
three planes of regression are thus fitted by the method of least 
squares, but in each case the sum of the squares of a different 
set of deviations is minimized. 

Using the third form of regression equation, the values of the 
statistics tor the plane of regression of Xi on X 2 and X 3 are 
derived as follows: 
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In the equation for the plane of regression, 

_ o 

— — P12.3-r P13.2 — 

(Tl (72 (73 

the problem is to determine and Piz.2 such thai 

2(7;-^) =2(7!-^-’ 7 -^-"7) “ ( 5 ) 

Since ^ is merely 

i 2 ) [(^1 - ^1) - (^i - ^1)]* = 2 

it follows that 2(Xi — will be a minimum when 

y 

W \<7i (7i/ 

is a minimum, and hence the plane of regression derived by 
minimizing the latter is the same as that derived by minimizing 
the former. 

If \ [ — 10^2 3 ^ 2 — ) is to be a minimum, the deriva- 

^ Vi (72 <73/ 

tive of this sum with respect to j 3 12,3 must equal zero and also its 
derivative with respect to ^13.2 must equal zero. These condi¬ 
tions are expressed in the follomng equations: 


\(ri 

”■ P12.3- 

(72 

/3is.2-) 

O's/ 

= 0 


-^12.3- ~ 

/3i3.2-) 

= 0 

\(Tl 

<T2 

<^3/ 



If in these equations the indicated multiplication is carried out 
and if each equation is divided by N, they become 

2:xiX 2 ^ '^xl ^ XX 3 X 2 __ ^ ' 

XXiXz ^ 2 X 2 X 3 ^ 2^3 _ n 


(7) 
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■D ^ -n u 4. ^ 4-1. 4- Sa;i ^3 i:>XzX2 

But it will be noted that - = ri2, ^ - = ris, - = r23, 

Ngi(T 2 N(Ti(rz N(rz(T2 


Y.xl 2 A ^^3 2 

= 0-1 and = 0-1. 


Hence Eqs. ( 7 ) reduce to the following: 


— i3i2.3 — P 1 S. 2 T 23 — 0 

5^13 — ^ 12 . 3^23 — ^ 13.2 = 0 



When solved by the ordinary method of simultaneous equa¬ 
tions, 


/ 5 i 2.3 

/ 3 i 3.2 


ri2 — rizr 23 

1 - ^23 
^13 — y’i 2 y ’23 
1 - ^23 



From Eqs. ( 9 ) it will be noted that, when r23 = 0 , P12.3 = ri2 
and iSi3.2 = riz* 

If the other planes of regression are put in the form 


X2 

0-2 


— 021.3 — + 023.1 ~ 

(Tl (Tz 


4 

(TS 


= 1831.2 - + 
0^1 


0*2 


and the values of the 0 ^s are determined in the same manner as 
the values of 0i2.3 and 0iz.2 were determined, the following results 
are obtained: 


021.3 

023.1 

031.2 

032.1 


ri 2 — rizr 23 

1 “ rh 

r *23 - ^ 12^13 

1 -rf 3 
ri 3 — ri 2 r 23 

1 -rh 

T 23 “ ^^12^"l3 

ll -^?2 



If the simple linear-correlation coefficients are known, there¬ 
fore, it is possible to obtain all the 0 ’s that enter into the three 
multiple-regression equations; and the regression equations in the 
0 form are thus determined. The other forms of the regression 


* See pp. 417 and 421. 



MULTIPLE AND PARTIAL CORRELATION 


413 


equations can be derived from the 0 form by calculating the 
6^s from the jS^s, and the a's from the 6’s and the means. For 

example, Eqs. (4), such as 612.3 = ,612.3 wilf give the values of 

the 6^s. The regression equations in the form 

x[ = 612.3^2 + biz.2X3 

are then determined. If, in this latter form, Xj — Xi is sub¬ 
stituted for x[, X2 — X2 for X2) and X3 — X3 for x^y the equation 
becomes 

X [ = Xi — 612.3X2 — 613.2X3 + 612.3X2 + 613.2X3 
from which it may be seen that the value of ai.23 is as follows: 

Ui .23 = Xi — 612.3X2 — 613.2X3 (Id) 

Similarly, the value of a for the other regression equations is 

found to be as follows: 

^i2.i3 = X2 — 621.3X1 — 623.1X3 (Id^ 

U3.12 = X3 — 631.2X1 — 632.1X2 (id") 

It is helpful in the use of these equations to remember the 
symmetry in the notation, that is, the symmetry in the position 
of the subscripts. 

Second-order Variances for Linear Plane of Regression. The 

formulas derived below measure the dispersion of the individual 
items about the plane of regression fitted by the method of 
least squares. As in the simpler case of the line of regression, so 
also for the plane of regression, the mathematical procedure con¬ 
sists in finding the standard deviation of the deviations of actual 
X values from the estimated values (X') given by the planes of 
regression. For example, by definition, o-i 23 = 2(Xi — Xj)ViV. 
The task reduces to one of evaluating such expressions in terms 
of quantities already known, that is, the r^s and the /3’s. This 
can be done as follows: 

Since it has been found easier to work with the variables when 
they are converted into deviations from their respective means 
and expressed in terms of their standard deviations, the formula 
for 0-1.23 will first be put in that form. This can be done by sub¬ 
tracting Xi from Xi and adding it to X^ and by multiplying both 



414 STUDY OF BiVARIATES AND MULTIVARIATES 


niunerator and denominator by neither of which will affect 
the value of the expression. Thus, 




where 


s(z, - x\y _ <r!s[(Xi - Zi) - (z; - Zi)]^ 


N 


(t\N 




N 




fl 0-1 


The problem is to evaluate Sd*. 

By Eqs. (3), the third form of the regression equation of Zi on 
Zs and Zs, it follows that 


d = —‘ - Hi 

ffi <ri 


A? 

-P12.3-Pl3.2 — 

O'! 0’2 (Tz 


( 12 ) 


Accordingly, for any given set of values of Xi, X 2 , and X 3 , there 
corresponds a particular value for d, which is the deviation of 
the actual value of Xi from the value of x[ obtained by putting 
the given values of X 2 and Xz in the regression equation. There 
are just as many d^s, therefore, as there are different sets of 
values of o^i, X 2 , and Xz- If any one of these d^s is squared, 
Eq. ( 12 ) gives 


= Pna-d- 

O’ 1 0’2 (Tz 


(13) 


If alf values of d are squared and summed, the following 
equation results: 




0-1 


<72 


0-3 


(14) 


But from Eqs. ( 6 ) and ( 12 ), it will be noted that 


Xxzd 

<72 


'Zxz f Xi 
<72 \<7i 


- ^12.3- - 

<72 <73/ 


= 0 


o-s 


2x3 

0-3 



•” P12.B — — fin. 

<72 




and 
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Therefore, 

2^2 = , (15) 

<ri 

The evaluation, therefore, of ^ will solve the problem. 
This can be done as follows: 

If each d, as shown in Eq. (12), is multiplied by the Xi/<ti to 
which that d belongs, the following result is obtained: 


Xid 

<Tl 


X\ XxX2 . XxXz 

"2 “* Pi2.3-P13.2- 

O’! 0 ’l <^3 


(16) 


X\d 

Values of — for all values of d and Xi sum up as follows: 

O'! 


2 


X\d 

O'! 



— /3i2.3 


l!tXiX 2 

(rio’2 


/ 5 i 3.2 


'LxiXz 

( 71(73 


(17) 


Hence, dividing by V, 

2^2 _ Xid _ Sxf ^ ^XiX2 o 'Z.XiXz 
If - A/Nal V(7i(72 N^z 

But since 

_2 20:1X2 _ 2x1X3 _ 

If ~ N ^2 ~ Nfifz ~ 

it follows that 

= 1 — ^ 12 . 3^12 “ ^13.2^13 (18) 

and, finally, from Eqs. (11) and (18), 

<^1.23 = •' = 0 - 1(1 “ ^12.3^12 “ ^ 13 . 2 ^ 13 ) (19) 

This gives an easy method for evaluating 23 when i\a ^ ' 
and /3’s have been calculated. Similar formulas fo^ 6 </ 3 > .r^ing 
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<r|.i 3 and <rli 2 , the scatters about the other planes of regression, 
are found to be as follows:^ 


<^2.13 ““ <^ 2(1 “ ^21.3?^12 “ (1^0 

^\.\2 ~ ^ 3(1 ^ 31 . 2^13 “ ^ 32 . 1 ^" 23 ) 

Note the symmetry of these three equations. 

COEFFICIENT OF MULTIPLE CORRELATION 


The multiple-correlation coefficient measures the correlation 
between the dependent variable and the two independent 
variables taken together. For reasons previously indicated,^ 

1 The dispersion <rj .23 niay also be calculated from the formulas: 

^1.23 ” ^'1.2(1 “ ^^.2) 0-1.23 “ 0'l(l “ ^12) (1 ” ^13.2) 


This may be demonstrated as follows: From Eq. (23'), 


^L.2 = 


and 

1 - r?3.2 = 

This gives 


(1 - rL)(l - rla) 


2 = 1 - ^12 Az + AA ~ ^13 - + 2ri3ri2r28 

(1 -r?3)(l -r^s) 

(1 - ^23) - (rfz + ri% - 2ri8ri2r23) 

(1 - r\,) 

^ {ri2 — ruviz) (ri3 — ri2r23) 

= 1 — ri2 —;--^13- 


(1 - rU)a - r?3.2) 


(1 “ ^ 23 ) 


(1 - 


Equation (9), however, shows that the two fractions on the right are equal 
respectively to /3i2.3 and i3i3.2. Hence 

(1 “ ^?2)(1 - ^!3.2) = 1 - ri2^12.3 - ri3i8i3. 


or on making use of Eq. (19), 

01.23 “ o'i(l ” ri2)(l “ ^13.2)' 

In Chap. XIII (p. 321) it was shown that o-J 2 = o'i(l — t \^. Hence the 
last equation may be written 

^■ 1.23 “ ^' 1 . 2(1 “ ^ 13 . 2 ) 

Thus both of the original formulas are derived from previously demon¬ 
strated relationships. Similar formulas hold for o-|,i 3 and These are 

*^2.13 “ ^12.3) 

= 0^(1 — ■” ^ 52 . 3 ) 

and 

^■ 3.12 “ ^■ 3 . 2(1 "" ^ 13 . 2 ) 

— <^3(1 r| 3 )(l — 

2 See pp. 351-353 and 305-368. 
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P2 — 1 __ ^1-23 
“^1.23 — ^ ^2 

may be taken as a good measure of multiple correlation. 

i2i.23 measures the degree of association between the Xi 
variable and X2 and Xz taken jointly. It can also be looked 
upon as a measure of the goodness of fit of the plane of regression 
of Xi on X2 and Xz to the set of Xi values. For if the fit is 
perfect, a?.23 will be zero and hence 221.23 will equal 1. Simi¬ 
larly, 222.13 measures the degree of association between the X2 
variable and Xi and Xz taken jointly, and 223.12 measures the 
association between the Xz variable and Xi and X2 taken jointly. 
They also can be looked upon as measures of goodness of fit of 
their respective planes of regression. It will be recalled that all 
three of these multiple-correlation coefficients may have dif¬ 
ferent values. 

The multiple coefficient of correlation 22i.23 is always larger 
than or at least equal to ru and r^; for it stands to reason that Xi 
can be estimated better (or at least no more poorly) from two 
variables X2 and Z3 than from X2 alone or Xz alone. Similarly, 
222.13 is greater than, or at least equal to, ri2 and r2z; and 223.12 is 
greater than, or at least equal to, Vn and r23. Furthermore, 
22 i, 23 is equal to the sum of rfg and if X2 and Xz are independent 
of each other; for, by Eq. (19), it follows that 

r)2 _ 1 ^1.23 _ 1 ^1(1 ” ^12.3^12 ■“ filZ.Z'f'ld) 

Ri.2z - ^ - ( 20 ) 

= fil 2 .zf'l 2 + ^ 1 Z. 2 TiZ 


If X2 and Xz are independent of each other, r23 = 0; and, by 
Eqs. (9), 1812.3 = ri2 and 1813.2 = ^13. Accordingly, if r23 = 0, 




= 

"12 

+ 

^13 

(21) 

Similarly, if ri3 = 0, 







■®ll3 

— «2 
^12 

+ 

•2Z 

(21') 

and if ri2 = 0, 








— «2 
— ” l 3 

+ 

'23 

(21") 


Consequently, by adding to the regression equation a second 
variable that is independent of the first, the accuracy with which 
the dependent variable can be estimated is increased by the 
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amount of the correlation between that variable and the newly 
added variable. 

It should be noted that only in special instances can a definite 
sign be given the multiple-correlation coefficient, although it is 
usually assumed to be inherently positive. For, as was indicated 
above, it may happen that the plane of regression *to which a 
given multiple-correlation coefficient pertains may slope upward 
in one direction and downward in another direction, indicating 
a positive relationship between the dependent variable and one 
independent variable and a negative relationship between the 
dependent variable and the other independent variable. In 
such an instance, the correlation between the dependent variable 
and the two independent variables taken jointly, that is, the 
multiple correlation, cannot be said to be either positive or 
negative. For such a multiple-correlation coefficient, no sign is 
attached. It is only when the dependent variable is positively 
or negatively correlated with each and every one of the inde¬ 
pendent variables that the multiple-correlation coefficient can be 
given a positive or negative sign. 

COEFFICIENT OF PARTIAL CORRELATION 

In the preceding sections of this chapter the discussion has 
centered on the problem of estimating the value of one variable 
from one or more other variables by means of a regression equa¬ 
tion. In connection with this problem, a coefficient measuring 
the degree of association between the dependent variable and the 
independent variables as a group was evaluated to show the 
accuracy with which such estimates can be made. This 
coefficient is a measure of the goodness of fit of the plane of 
regression. 

When there are interrelationships among three or more 
variables, another problem appears. It often happens that an 
apparent relationship between two variables is in reality the 
result of their individual connection with a third variable that 
commonly affects them both. For example, it may be that the 
correlation between the price of cotton and the price of rayon 
i^ due largely to the correlation of each of them with the index 
of wholesale prices. In other words, the concomitant move¬ 
ments in the prices of .cotton and rayon may be due, funda¬ 
mentally not to any direct relationship between these two 
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competing commodities, but primarily to their common tend¬ 
ency to rise and fall with the general price level; they may be 
joint effects of a common cause. Similarlythe concomitant 
variations in first- and second-semester English grades of 
freshmen in a woman’s college may be basically accounted for by 
their respective relationships to the grades attained by the same 
freshmen in verbal scholastic-aptitude tests or to their school 
records. 

The statistical device for discovering how much correlation 
there is between one variable and another variable when a third 
variable or a number of other variables are “held constant” is 
the method of “partial correlation.” The correlation between 
the freshmen grades in second-semester English, Xi, and the 
freshmen grades in first-semester English, X2, when the grades 
of the respective freshmen in verbal scholastic-aptitude tests, X3, 
are held constant is the partial correlation between Xi and X2. 
Such a partial correlation coefficient will show how much con¬ 
nection there is between grades in first- and second-semester 
English independent of their common connection with grades 
in verbal scholastic-aptitude tests. The coefficient of partial 
correlation, indicated in this instance as 7*12.3, will measure the 
degree of this independent association.^ 

A variable is, of course, not held constant in any physical 
sense. It is not possible in any way ex post facto to change the 
fact that a Mount Holyoke freshman, who had grades of 160 
in first-semester English and 160 in second-semester English, 
had also a grade of 437 in her verbal scholastic-aptitude test; 
nor is it possible to change the fact that another Mount Holyoke 
freshman, who had grades of 120 in first-semester English and 
160 in second-semester English, had also a grade of 384 in her 
verbal scholastic-aptitude test. The ideal of holding constant 

^ The position of the point in the subscripts of ri2.3, rather than the fact 
that it is a smaller letter, distinguishes it from R\.2z» In the latter, the 
point comes after the first digit, setting off the two independent variables 
Xz and X2, jointly associated with the dependent variable X\. In the 
coefficient of partial correlation, the point sets off the variable that is held 
constant coming immediately after the pair that are correlated, X\ and 
X2. Thus, in ri2.3, Xz is held constant while Xi and -X'2 are correlated; 
in ri2.846, Xz, -X’4, and Xz are held constant while Xi and X^ are correlated. 
The symbol 12 1.2345, by the position of the point, indicates a multiple-cor¬ 
relation coefficient between X\, dependent, and Xz, Xz, X^, and Xz taken 
jointly as the independent variables. 
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one of the three variables is wholly a statistical idea. It consists 
in eliminating from each of the two variables between which 
the partial correlation is sought the effect of the third variable. 
More specifically, the line of regression of Xi on Z3 is found, and 
the deviations of the actual values oi Xi from those given by the 
line of regression X[ = ai.3 + h\zXz are determined. These 
deviations from the line of regression represent the variation in 
Xi that is left over after the linear effect of Xz is eliminated. 
Similarly, the line of regression of X2 on Xz is computed, and 
the deviations of the actual values of X2 from those given by the 
line of regression X^ = a2.z + h^zXz are determined. These 
deviations from the line of regression represent the variation in 
X2 that is left over after the linear effect of X3 is eliminated. 
When these residual deviations in Xi and X2 are correlated, the 
result is the partial-correlation coefficient between Xi and X2 
when X3 is held constant, because the effect of X3 upon each of 
them has been eliminated. 

To calculate a partial coefficient of correlation the extended 
calculations involved in computing two lines of regression and 
measuring the deviations of the actual values from them is not 
necessary. The coefficient of partial correlation can be alge¬ 
braically evaluated in a formula that makes it possible to compute 
it from the coefficients of simple linear correlation, as follows: 
The deviation of Xi from the line of regression of Xi on X3 

may be written as Xi — x[ or Xi — riz ~ Xz where the x^s are 

<^3. 

measured from their respective means. Similarly, the deviation 
of X2 from the line of regression of X2 on X3 may be written as 

X2 — X2 or X2 — 7*23 “ Xz^ The standard deviations of these 

0'3 

deviations from the lines of regression have already been deter¬ 
mined to be 0*1.3 and 0*2.3, respectively. In accordance with the 
ordinary formula for a simple correlation coefficient, that is, 
r = ZxiX 2 /N< 7 i(r 2 j the partial-correlation coefficient between Xi 
and X2, when X3 is held constant, is, by definition, 



Xo'i.30*2.3 


( 22 ) 
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If the numerator is expanded and the values for 0-1.3 and 0-2.3 

are substituted in the denominator, this becomes 

✓ 

^ XiXi — »‘23 ^ ^ * 12:3 “ »‘13 ^ ^ + TnTn ^ ^ A 

Vi — As ovt \/i — Ai 

Upon transferring the divisor Nirwi from the denominator to 
each term of the numerator, this becomes 

l^XiX2 __ ^ liXiXz _ ^ 'LxiXz , ^0-2 2x| 

r — N<Ti(J2 0*2 NfflUz O’! NcT^Gz o-^q-2 N(t\ 

Vl — rfs Vl - 


in which 7*12,7*13, r23 can be substituted for their respective equiva¬ 
lent values, making the formula appear as follows: 

^ ^ ^ ^ ^12 — ^23^13 — ri3r23 + TuT^z 

Vl - Az Vl -Az ■ 

which reduces to 


ri 2.3 = 


ri2 — rizT^z 

Vl - Az Vl - Az 


( 23 ) 


Similar formulas for the partial correlation between Zi and 
Xz when X2 is held constant and the partial correlation between 
X2 and Xz when Xi is held constant are as follows:^ 


ri 3 — ri 2 r 23 

Vl - Az Vl - Az 

Tiz — rnriz 

Vl - Ai2 Vl - Az 


( 23 ') 

( 23 ") 


From Eq. ( 23 ) it can be seen that if Xi and X^ are both uncor¬ 
related with Xz, that is, if riz and r23 are zero, then ri2.3 = ri2. 
Similarly, if ri2 and r23 are zero, ri3.2 = ^13; and if ri2 and ris 
are zero, 7*23.1 = ^23. 


1 For the coefficient of partial correlation, the order of the numbers in the 
subscripts either before or after the point is a matter of indifference; that 
is, ri2.3 = r2i.3 and ri2.346 = 7-21.435, etc. It will be remembered that this is 
not true with respect to the order of the numbers in the subscripts before 
the decimal in the 6 ^s and the ^'s; that is, 612.3 9 ^ 621.3, /3i2.3 7^ ^21.3. 
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Any one of the following formulas, which can^ be readily 
derived algebraically [see Eq. ( 9 )] may be used in place of, or as 
a check on, Eq. ( 23 ): 


ri2.3 

— \/^12.3^21.3 


( 24 ) 

ri2.3 

— P12.3 1 /-- 

3 ^ 

( 25 ) 


\vl- 


7*12.3 

a 0’2.3 

— P12.3- 


( 26 ) 


0^2 0-1.3 


ri2.3 

7 0-2.3 

— O 12.3 - 

0-1.3 


( 27 ) 


Thus the partial-correlation coefficient can be calculated directly 
from the /S’s or from the Vs and the dispersion formulas for the 
simple lines of regression, as well as from the simple r^s. The 
equations for the other coefficients of partial correlation are 
symmetrical with Eqs. ( 24 ) to ( 27 ). 

The partial coefficients of correlation illustrated above are 
called correlation coefficients of the ‘‘first order,while the sim¬ 
ple coefficients ri2, r23, etc., are called “ zero-ordercorrelation 
coefficients. If there are more than three variables involved so 
that a partial coefficient of correlation, 7*12.34, for example, is 
found, it is called a “second-order^' coefficient of correlation; 
similarly, ri2.345 is a “third-order" coefficient of correlation, etc. 
This classification is helpful in distinguishing different sets of 
correlation coefficients. The same terminology may be con¬ 
veniently carried over to the other statistics in a correlation 
problem. Thus, <ri is a zero-order standard deviation, a 1.2 is a 
first-order standard deviation, etc.; 612.3 is a first-order regression 
statistic, 612.34 is a second-order regression statistic; etc. 

ANALYSIS OF VARIANCE IN MULTIPLE CORRELATION 

When a plane of regression, for example, Xi on X2 and X3, 
is fitted to a trivariate frequency distribution by the method of 
least squares, variation in Xi may be viewed as made up of a 
part that is due to its linear association with X2, a second part 
that is due to its linear association with Xs, and a third part 
that is due to association with factors independent of both X2 
and Xs- For the least squares Equation (6) show that 
Xx2(i = 0 and 2x3d = 0, which means that neither X2 nor X3 
is linearly correlated with deviations from the plane of regres- 
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sion. In the case of a normal trivariate frequency distribution, 
the independent variables are not correlated in any way with 
the deviations from the plane of regression. ^Owing to the lack 
of correlation, the variance in the dependent variable is equal to 
the variance of the values given by the plane of regression plus 
the variance of the deviations from the plane. This may be 
shown as follows; 


Xi = x[— (a;i - rrj) 

and 

= (^ 0 ^ — 2{xi — + {x I — x[y ( 28 ) 

If all the deviations squared, like x\, described in Eq. ( 28 ) are 
added, the following result is obtained: 


i:x\ = 2(x0" + ^{xi - x[y - 2S(xi - x[)x[ 


or, by substituting x[ = 


Uii.z - + i 8 is .2 -) 

\ 0’2 O's/ 


(Ti for the last xi. 


^ X? = ^ ^ (a;i - x[)‘ 


2<ri|8i2.s V (xi - x'l) ^ 

f 0*2 

- 2ffiy3i3.2 V (xi - x\) — 

^ O’s 


But, as just stated, the deviations X\ — x\ are not linearly corre¬ 
lated with xt. and Xs, so that the cross-product terms are zero. 
Therefore, 

i:xl = z{x[y + zixi - x[y 
If each term is divided by N, it is found that 

= < + < 7^23 ( 29 ) 

In Eq. ( 29 ), al / may be further evaluated, as follows:^ 



See p. 411. 
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If the above is divided by N and /, (j\, and are substituted 
for equivalent values, the expression becomes 

o’x/ = + ^13.2 + 2 r 23 / 3 l 2 .3^13.2) ( 30 ) 

By substituting this value for in Eq. ( 29 ), it is found that 

O'! = 0-l(^12.3 + ^\z.2 + 2r23/3l2.3^13.2) + 0 - 1.23 

or 

1 = /5i 2.3 + i^l3.2 + 2r23/5l2.3/?13.2 H-(31) 

O’! 


From the manner in which Eq. ( 31 ) was derived, it is known 
that the terms on the right side each represent a percentage of 



^2 Ofher 
-^{factors than 
{X^anclX^ 


Fig. 118.—Illustration of coefficients of direct determination, meaning of 2r-^ 
cross-product term and the residual variance. 


the total variance of X\, This may be interpreted in the follow¬ 
ing manner: 

Coefficients of Direct Determination, The first term of the 
right side of Eq. ( 31 ), /812.3, may be interpreted as the percentage 
of the total variation in Xi that is due to its direct association 
with X^, It has consequently been called the ^‘coefl 5 cient of 
direct determination’^ of Xi by X2* Similarly, /?i3.2 may be 
interpreted as the percentage of the total variation in X\ that 
is due to its direct association with X3. Figure 118 depicts the 
coefl&cients of direct determination by arrows pointing from X2 
directly to Xi and from Xz directly to Zi. 

Coeffiwients of Net Regression, The beta unsquared, Pn.z, 
describes the change in Zi in standard-deviation units that 
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accompanies a given change of Z2 in standard-deviation units, 
when Xz is constant. Geometrically, /5i2.3 is the slope of the line 

of intersection of a plane perpendicular to the — axis with the 

plane of regression. 


£1 


^ 12.3 ~ + ^ 13 . 
0-2 


2 


£3 

0-3 


The statistic iSi2.3 has been called the ^^coeflB.cient of net regres¬ 
sion’’ of Xi on X2 in standard-deviation units. The coeflBlcient 
of net regression in original units is 612.3. 

Coefficient of Joint Determination, The term 2 r 2 z 0 i 2 .sfii 3.2 may 
be taken as representing the percentage of the total variation in 
Xi that is due to the joint or combined effect of X2 and Xz 
resulting from the correlation between these two variables. In 
Fig. 118 the influence of X3 on Xi through its correlation with X2 
is depicted by the line ua'; the influence of Z2 on Xi through its 
correlation with X3 is depicted by the line 66'. Relationships 
along these lines indicate the significance of the r -/3 cross-product 
term of Eq. ( 31 ). While variation in X2 may directly affect 
Xij it may also, through its correlation with Xz, bring about a 
change in X3 and hence cause further variation in Xi resulting 
from the connection between X3 and Xi. Similarly, variation 
in X3 may affect Xi, not only directly, but also indirectly 
through the association of X3 and X2. The term 2r2tfii2.3Pi3.2 
may be taken to represent the combined indirect variation in 
Xi resulting from variations in X2 and X3. 

Meaning of Residual Variance. The portion of variance in 
Xi due directly to X2 is PI2.Z7 the portion due directly to X3 is 
^13.27 the portion due to the joint influence of X2 and X3 is the 
2 r -/3 cross-product term; the remainder of the variance is due 
to other factors not linearly correlated with X2 and X 3 . This 
is depicted in Fig. 118 . The sum of all of these four terms is 
equal to the total variance, or, expressed as a proportion, to 1. 
The sum of the first three terms may be interpreted as the total 
portion of variance in Xi that is due to its association with 
X2 and X3 jointly; the final term o-f.asAi represents the portion 
of the variance in Xi that is due to its association with other 
factors not linearly correlated with X2 and X 3 . The sum of all 
these portions of the total variance of Xi is necessarily equal to 1. 
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The Coefficient of Multiple Correlation, From the previous 
discussion regarding the interpretation of the simple correlation 
coefficient and the correlation ratio, it is to be expected that a 
similar interpretation might be made of the multiple-correlation 
coefficient, that is, that represents the portion of the total 
variance in Xi which is due to its joint association with X2 and 
Xz. Equation ( 31 ) shows that this is actually the case; for it 
will be recalled that 

_ 1 di^3 

^1.23 ^ ^2 

(^1 

Therefore, it follows by Eqs. ( 29 ), ( 30 ), and ( 31 ) that 

RI 2 Z = + PIb .2 + 2r*,j8i2.3/3,3.i! = "T (32) 

This interpretation of Rl,23 is similar to that previously made of 

(J .2 ^ 2 

>■12 = -it ^ 

<Ti (Ti 

where is the standard deviation of the line of regression. 

It has been noted that Rl,2z can be interpreted as the portion 
of the total variance of that may be attributed to its joint 
association with X2 and Xz- It has also been shown that 
^^23 = Pi2.iri2 + fiiz.2riz [see Eq. ( 20 )]. Consequently, it is 
possible to view fin.zr 12 as the portion of the variance of Xi 
that is due to its total association, both direct and indirect, with 
X2 and also to view Piz.2riz as the portion of the variance of Xi 
that is due to its total association, both direct and indirect, 
with Xz -' Therefore, these two products have been called “coef¬ 
ficients of total determination’’ of Xi by X2 and Xz- 

When either of the products is negative, however,^ it is pref¬ 
erable to resolve the expression into its equal, namely, 

^12.3 ^13.2 "t" 2r23^12.3^13.2 

^ The interpretation of the products as coefficients of total determination 
runs into difficulties in any particular case if either of them is negative. 
Either ^12.3^12 or ^u.2ri3 but not both may be negative, since their sum 
equals Rl,2z, which, of course, is not negative. To say that a variable 
contributes a negative percentage to the total variance of Xi has no mean¬ 
ing, and consequently when either term is negative the interpretation is 
imaginary. 
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because 

'K?.23 = ^12.3^12 + /3l3.2n3 = /3i2.3 + ^13.2 4^ 2r23^12.3iSi3.2 

Expressed in the and cross-product form, it is easier to under¬ 
stand why a negative value of ^12.3^12 or of (but not of 

both) may occur; for whenever either i3i2.3ri2 or is nega¬ 

tive, it will be found that the joint contribution of X2 and Xz 
to the variance of Xi, represented by 2r2zPi2.z0iz.2j is also nega¬ 
tive. This follows because either r23 is negative or P12.Z and 
P1Z.2 are of opposite signs. In such a case, the dii^ct effect of 
X2, for example, on the variation of Xi would be opposite in 
sign to its indirect effect, that is, through its correlation with 
Xz'j and the existence of this indirect link to Xi through Xz 
would tend to diminish the total variation caused in Xi by 
changes in X2. This dampening effect of a negative value for 
the r -/3 cross-product term is what explains the existence of a 
negative value for either Pn.zTn or Piz.2'>'iz- The form where they 
are squared also shows the difficulty in trying to assign to the 
variables X2 and X3 an independent part in accounting for the 
variation in Xi. When there is a joint contribution by the two 
variables, it becomes misleading to attempt to break it up and 
assign part of it to one and part to the other, as the foregoing 
interpretation based upon the form + Piz.2ri3 appears to 

do. 

As noted above, the part of the variance, <rl, that is due to 
correlation between Xi and X2 and X3 together is described as 
follows:^ 

^2 __ 7?2 2 

■” -^1.23<^l 

The right side of this expression describes the variance of the 
plane of regression, just as rl^l describes the variance of the 
line of regression. This part of the variance in the Xi variable is 
made up of two parts that may be analyzed as follows 

<^ 1.23 “ ^ i(X “ ^12) (1 "" ^13.2) 

= <^1 - ~ rlA + rl2rlz.A 

= <rl- riy, - - rL)rL.2 

= <rl - - <^i.2^k2 


1 Cf. Eqs. (29) and (31). 
* See footnote, p. 416. 
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Therefore, by transposing terms, 

— ^ 12^1 "t" <^ 1 . 2 ^ 13.2 H“ ^ 1.23 

It follows from Eqs. ( 33 ) and ( 29 ) that 

4 = + alA,, 

in which is the scatter about the line of regression of Xi on X2. 

It is thus seen that the variance <7^ / of the plane of regression 
consists of two parts: (1) the variance due to the total linear asso¬ 
ciation between Xi and X2, which is ri2<ri; and (2) of the remain¬ 
ing variance about the line of regression (0-1.2), the portion that 
is due to the influence of Z3 not already included in ri2, namely, 
^i3.2<^i.2- The partial coefficient of correlation ri3.2 describes the 
relationship between Xi and Xz when Z2 is held constant; in 
other words, it is the net correlation between Xi and X3. The 
square of this partial coefiicient of correlation is therefore the 
coefficient of proportional variance in Xi due to net correlation 
with Xs. If to variance due to total association with Z2 is 
added the variance due to net correlation with Xg, the result is 
the variance due to total correlation with X2 and X3 jointly. 
The remainder of the variance, as Eq. ( 33 ) indicates, is due to 
other causes not linearly correlated with X2 and X3, 

Analysis of Variance and Causal Relationships. Where other 
knowledge suggests that causal relationships run only in one 
direction, the preceding analysis takes on considerable signifi¬ 
cance. In biological investigations, for example, where the 
effect of heredity is being studied, it seems logical to assume 
that variations in parents cause variations in offspring and that 
the causal relationship does not run the other way. Again, in 
certain economic problems, it is to be assumed that fluctuations 
in weather conditions bring about changes in economic con¬ 
ditions, but that the latter have no effect upon the former. In 
one-directional setups of this kind, the j 3 ^s take on the full signifi¬ 
cance of the connotation ‘‘coefficients of determination.^' The 
/ 3 ^'s measure the amount of variation in the dependent variable 
caused by fluctuations in each of the independent variables 
separately, and in conjunction with the other variables in the 
2 r -/3 cross-product expression. It is in this type of problem that 
correlation analysis attains its greatest significance. 
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Where there is no reason to believe that causal relationships 
ijire unilateral, the interpretation of the results of correlation 
analysis in terms of causal determination becomes unscientific. 
In most problems there is interaction between the variables 
rather than a strictly one-directional association. It is still 
possible to estimate what is called the dependent variable from 
knowledge of so-called independent variables, but the latter 
must not be looked upon as determining the former. With 
reference to a regression equation used for purposes of estimation 
and prediction, the /S^^s and the 2 r -/5 cross-product terms are 
useful in showing how important certain factors are, separately 
and in conjunction with other factors, in making estimates or 
predictions of the dependent variable. They become coefiicients 
of determination only in the sense of statistical determination or 
estimation and not in any sense of physical, biological, or eco¬ 
nomic causation. 

Of incidental importance to method and theory but of con¬ 
siderable importance to practice, Eq. ( 31 ) provides a cross check 
on the arithmetical work for finding most of the statistics in the 
multiple-correlation problem. 

EXTENSION OF MULTIPLE- AND PARTIAL-CORRELATION 
ANALYSIS TO FOUR VARIABLES 

The foregoing analysis, which has pertained to only three 
variables, may be extended to cover a greater number of vari¬ 
ables. In this section its extension to a four-variable problem 
will be discussed. In the next section its extension to any desired 
number of variables will be considered. 

When there are more than two independent variables in a 
regression equation, the number of ^ coefiicients to be determined 
is correspondingly increased. If there were four variables, 
for example, that is, three independent variables, the P form 
of the regression equation would become 

~ = / 3 i 2.34 “ + / 5 i 3.24 “ + ^14.23 —^ ( 34 ) 

(Tl 0’2 <^3 . 0*4 

If this regression plane is fitted by the method of least squares, 
the jS’s can be determined in terms of the r’s in the manner 
described for the three-variable problem; but, there being three 
equations to be solved for three /S^s, the solutions are not so simple 
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as those given by Eqs. (8) and ( 9 ). Some special method of 
solution may be employed, such as the so-called Doolittle 
method” of substitution or some determinant method.^ 

The chief advantage claimed for the Doolittle method is that 
it provides a check at each step in the problem; but experience 
for several years with its use for four-variable correlation prob¬ 
lems has demonstrated its complexity and sensitivity. For a 
multivariate problem of more than four variables both the 
Doolittle work sheet and the determinant methods become 
increasingly cumbersome. 

A saving in computation is obtained by using formulas based 
upon the algebraic evaluation of correlation statistics. This 
saving was demonstrated for the trivariate case in the first part 
of this chapter where it was found that the jS’s could be evaluated 
in terms of the zero-order r^s [see Eq. ( 9 )]. In addition, however, 
it is possible by symmetry to extend the algebraic results of 
some of the formulas to apply to a multivariate correlation of 
as many variables as desired, in effect to evaluate algebraically 
the correlation statistics for the general regression equation. 

“ di.jk. . .n . . Xj “b hik.j ... Xk "b •••-!- ^ ^ Xn (35) 


First the correlation statistics will be algebraically evaluated 
for the four-variable case, which is done as follows: 

The normal least-squares equations for fitting a plane of 
regression of Xi on A2, X3, and X^j when divided by N, are 
as follows 


'2x1X2 

N<Ti(T 2 

2 xiXz 

N<T\(Tz 

2 xiX/i 

iV(7i<7-4 


= ^ 12.34 
= ^ 12.34 
= 012.34 


2 x 1 , o 2x2X3 


+ 014.23 


2X2X3 

iV(720'3 

SX2X4 

No' 2 <ri 


“b 013.24 
+ 013.24 


2 x1 


+ 014. 
2X3X4 


Nal ’ 


N(Tz(r\ 


+ 014.: 


2x2X4 

Na 20 ‘i 

2x3X4 

N<tz(Tji 

2 x 1 


These may be written, by substituting for their equivalent 
values, ri2, v*, rn, ru, rn, <r|, r84, r^, al, as follows: 


^ In theoretical work the determinant method gives neat and easily 
remembered solutions. In practical work, however, many statisticians 
prefer the method of substitution. 

2 See pp. 411 -- 412 . 
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^12 = ^12.34 4“ ^2Z013.24 4“ ^24014,23 \ 

TlZ T23^12,34 4" ^13.24 4“ ^^34i3i4.23 | ( 36 ) 

^*14 = T24^\2.34 4" 2*34^13.24 4" 1^14.23 j 

If the first of Eqs. ( 36 ) is multiplied by r23 and the result is 
subtracted from the second, fi\2.34 is eliminated and it is found 
that 


^’is 2*i2r23 — (1 — r\^^iz.24 4“ (2’34 “ 2*232*24)i3l4.23 


or 


2*13 2 *i 2 r 23 _ a , 2*34 — 7 * 232*24 ^ 

nr^iT “ ~i - ri3 

which, by Eqs. ( 9 ), is equivalent to saying 

/ 3 i 3.2 = ^ 13.24 4 “ / 543 . 2 ^ 14.23 ( 37 ) 

Similarly, if the first of Eqs. ( 36 ), multiplied by r24, is subtracted 
from the third, i3i2.34 is eliminated and it is found that 


^14.2 = ^14,23 4" /534.2^13.24 


( 38 ) 


Correlation Statistics in Terms of Lower-order Correlation Sta¬ 
tistics of the Same Kind. From Eqs. ( 37 ) and ( 38 ), solved simul¬ 
taneously, it is found that 


—. i^l4.2 ~ /?34.2/?13.2 
1 ““ ^34.2^43.2 


( 39 ) 


All the other second-order jS's may be evaluated in a similar 
algebraic manner, or written by symmetry; for example, by 
symmetry with Eq. ( 39 ), the / 3 i 3.24 is as follows: 


— ^13.2 ~ /^43.2^14.2 

1 ■” P43.2P34.2 


( 40 ) 


Equations ( 39 ) and ( 40 ) are equivalent to the following expres¬ 
sions in terms of the Vs, because, if 614.2 634.2 and similar 

O'! (73 

values of p^s in terms of corresponding Vs and standard-deviation 
ratios are substituted, the standard deviations cancel out.^ 

^ Since the order of digits in the subscript after the point is immaterial 
in both the h and ^ statistics, the following formulas may be used as checks 
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614.23 == 

6 1 3.24 == 


614.2 &34.2&13.2 
1 *” i>34.2^43.2 

613.2 — &43.2^14.2 

1 “ 643.2634.2 


( 39 ') 

(40') 


Correlation Statistics in Terms of Lower-order r^s and <t^s. It is 
possible to express the above formulas in terms of lower-order 
r's and o-'s if this method of calculation is preferred. It was 
noted in Eq. (26) that 


Tu.Z = ^12.3 


^^‘3 

0’2 <^1.3 


and, by symmetry, it follows that 



<74 0 - 1.2 

Also, by Eq. (27), 

7 0*4.2 

ri4.2 = 014.2 — 
0-1.2 

Consequently, 

0 _ ^4 0*1.2 

P14.2 = ri4.2- 

^ 0*1 < 74.2 


h ^1-2 

O 14.2 — ri4.2 — 
0*4.2 


If these values of the /S's in terms of r's and standard deviations 
are substituted in Eq. (39) and the values of the 6's in terms of 
the r's and the scatter are substituted in Eq. (39'), the following 
results are obtained: 


/ 3 i 2.34 = 
612.34 = 


y*12.3 — r42.3ri4.3 ^ 0*1.3 
I ^24.3 O’! (72.3 

ri2.3 ~~ r24.3y*14.3 0*1.3 \ 

1 ^ 24.3 <^ 2.3 


(41) 

(410 


Correlation Statistics in Terms of Correlation Statistics of Same 
Order, The preceding formulas may be transformed to still 


on the ones given in the text: 


i8l4.»2 = 


^14.3 ~~ j824.>/3l2.8 
1 — ^ 24 . 2 ^ 42.3 
&14.S ~ 624.2612.2 

1 ““ 624.2642.3 


614.32 


(39) 

(39') 
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another form in which the correlation statistics are expressed in 
terms of other correlation statistics of the same order. Thus, 
since 0-1.34 = 0-1.3 Vl ““ ^14.3 0-2.34 = o’2.3 \/\ — follows 

that 


0 - 1.3 

0-2.3 


0’1.34 

Vl - rUs 

0'2.34 


If these values are substituted in Eqs. ( 41 ) and ( 41 '), the follow¬ 
ing results are obtained: 


a _ ^ 0-2 0-1.34 

P12.34 — ri2.34- (42) 

0-1 0-2.34 

612.34 = ri2.34 — (42') 

0-2.34 

Similar algebraic procedure will show that all the other /S's 
and 5 's have formulas that are symmetrical with the above. 
For example, 


O — <v. ^3 ^ 1-24 

P13.24 — ri3.24- 

O’! 0-3.24 

7 _ ^ 0'1.24 

O13.24 — ri 3.24 - 

0-3.24 

o _0-40-1.23 

P14.23 ri 4.23 - 

O'! 0-4.23 

7 ^ 0'1.23 

O14.23 — ri4.23-- 

<^4.23 


Partial Correlation in the Four-variable Case. If there are 
four variables Xij X2, X3, and Z4, the partial-correlation coeffi¬ 
cient between Xi and X2 when X3 and X4 are held constant, 
that is, ri2.34, is defined as the simple correlation coefficient 
between the deviations of Xi from the plane of regression of Xi 
on X3 and X4 and the deviations of X2 from the plane of regres¬ 
sion of X2 on X3 and X4, Accordingly, partial ^correlation, when 
four variables are involved, is no more complex algebraically 
than when three variables are involved; both are simple linear 
correlations between residuals. In the three-variable problem 
the partial coefficient of correlation is found by correlating the 
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residuals about lines of regression; in the four-variable problem 
the partial coefficient of correlation is found by correlating the 
residuals about planes of regression. The formula for the 
second-order partial coefficient of correlation r 12.34 is obtained 
in the same algebraic manner as for the first-order partial 
coefficients of correlation [see Eqs. (22) and (23)]. The formula 
is 




y*13.2 ~ ^14.2^34.2 
Vi - »’ 14.2 VI - »’ 34.2 


(43) 


This is a partial coefficient of correlation of the second order. 
It will be noted that the formula runs in terms of the correlation 
coefficients of the first order. To make use of this formula in 
practice, therefore, it first becomes necessary to calculate the 
zero-order correlation coefficients and then the first-order coeffi¬ 
cients before the second-order coefficients can be determined. 
The calculation of higher-order partial-correlation coefficients is 
thus similar to the calculation of those of lower order, but they 
require additional work. 

Third-order Variance and Multiple-correlation Coefficient. 

The formula for third-order variance in the four-variable prob¬ 
lem is obtained by adding a term to the formula for scatter in 
the three-variable problem, as follows: 

<^1.234 ~ <^l(l — /5 i 2.34^12 — /^IS.24^13 ”• ^14.23?'l4) (44) 

The equation for the multiple-correlation coefficient is, by 
symmetry, as follows: 

= 1 - (45) 

or 

■Rl.234 == 1^12.34^12 + 1^13.24^13 + ^14.23?^14 


EXTENSION OF MULTIPLE- AND PARTIAL-CORRELATION 
ANALYSIS TO ANY DESIRED NUMBER OF VARIABLES 

For a three-vaHable and even for a four-variable correlation 
problem, it is probably desirable first to calculate the jS^s from 
the lower-order jS's. In the thi*ee-variable correlation problem, 
this means calculating the /S^s from the zero-order r’s, which at 
that level correspond to the iS's. 
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For more than four variables it may be better first to calculate 
the higher-order r’s from the lower-order and thereafter 
obtain the h and p statistics. Following is the extension to the 
general multivariate problem of the various formulas, showing the 
series of formulas in some cases from the simple correlation to 
the multivariate in order to depict how they are readily obtained 
by symmetry. 

Extension of Formulas to General Multivariate Problem. 

Statistics for the Regression Planes. The h and statistics are 
evaluated, in general, by the following formula: 

(а) b’^r- 

(T 

or 

1 . Z = r .. . .n 

.n — * tj.kt. . .n -- 

(riht,,,n 

This formula can be used when the r^s and of the same order 
have already been obtained, but it does not provide for the cal¬ 
culation of the b^s from lower-order statistics. 

In the simple linear-regression equation, carrying out the 

instructions of the above formula gives 612 = If there 

cr 2 

• ^13 

are three variables, it becomes 612.3 = ^12.3 —etc. 

0'2.3 

The relationship between the 6’s and the jS's is not affected 
by the number of variables and may be summarized as follows: 

(б) i8l2.3 

P12.ZA 
P1Z.2 

1 ^ 13.24 


fiij.kt. . .n — bij,kt. . .n 

ai 

The a’s for the various planes of regression areiound by formulas 
such as the following: 


7. ^2 

= O12.3 — 
O'! 

= O12.34 “ 
0-1 

= 613.2- 
0-1 

= O13.24 — 

Vl 
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(c) ^1.234 — — 6 i 2.34-X^2 — bl3.24-X^3 — 6i4.23*X^4 

^2.134 = X 2 ““ 621.34^1 bzz.uXz — fc24.13-X^4 

. .n “ Xi bii,}e,,,nXj bik.j.. .rtXje * * * bin.kj.. .mXn 

High-order Variances, The higher-order variances may be 
estimated by using either the jS formulas or the partial r formulas: 

(d) 0^1.234 = 0-l(l "" ^12.347*12 — filZ,24riZ — fiu.ZZ^u) 




.. ,nX%i , ,nVik 


^tn,k, • ,‘^fin) 


<^1.2 = “ rl) 

<^ l , 2Z ” ^1(1 “ ^12) (1 ^13.2) 

<^1.234 = <^l(l ““ ^12)(1 ^13.2)(1 ^ 14 . 32 ) 

= - ^^(1- rij • - (1 - ri,,kt‘,J 

The Multiple-correlation Formulas, As in the case of the 
relationship between the Vb and the /3^s, the formula for the 
multiple-correlation coefficient is of the same form regardless 
of the number of variables; it is always a comparison of the 
residual scatter about a plane of regression with the total standard 
deviation. In the four-variable and the general multiple-variable 
cases it is as follows; 

{e) Rhu = 1 - 

2 

p2 _ 1 _ ^i.jkt. . .n 

The Partial-correlationFormulas. The forms of the formulas for 
the partial coefficients of correlation are also independent of the 
number of variables in the problem, because the partial correla¬ 
tion coefficient always pertains to a simple linear correlation. 
The formulas are therefore symmetrical, as follows: 

Second-order partials: 


ri2.34 = 


y*12.3 ^*14.3^*24.8 

Vl - rh.3 Vl -*r 


And in general, the formula for calculating partials of a given 
order from the next lower order partials: 


^ij.ko . . . n 


^ij.ko . . . »—1 '^in.ko . . . n—l rjn.ko . , . n—1 


-rt.b, • • ■ »-i Vl -rL'i 


in,ko • • • ’*"■1 






CHAPTER XVII 


ANALYSIS OF A MULTIVARIATE FREQUENCY 
DISTRIBUTION ILLUSTRATED 

To illustrate the analysis of a multivariate frequency dis¬ 
tribution, data on grades of 81 freshmen in a woman^s college 
were selected. These are arranged in Table 47 in such a way 
as to facilitate the construction of simple linear-correlation tables 
and to facilitate detailed study of the multivariate distribution. 
The first part of this chapter will illustrate trivariate-frequency- 
distribution analysis. The X 4 variable is included in the table 
so that later in the chapter the analysis can be extended to four 
variables. Beyond four variables, the method proceeds in a 
symmetrical fashion. 

Examination of Multivariate Distribution. The first step in 
the analysis of a multivariate distribution is to determine how 
well the assumption of linearity of relationship is approximated. 
The scatter of cases in the correlation table for Xi and X 2 
appears to indicate simple linear regression between Xi and X 2 .* 
In Tables 48 and 49, correlation tables for Xi and Xz and for 
X 2 and Xzj respectively, the scatter of cases suggests that these 
regressions might be only slightly if at all nonlinear. As pointed 
out in the preceding chapter, however, the simple correlation 
charts cannot be expected to reveal how much net regression 
exists in a trivariate correlation problem; accordingly, further 
study of the trivariate distribution of Xi, X 2 , and X 3 , should be 
undertaken. In order to do this the multivariate distribution 
itself, shown in Table 47, must be studied. 

For example, the net regression of X 2 on X 3 may be tested 
in a preliminary fashion by holding constant the Xi variable. 
Accordingly, analysis may be made of the X 2 and X 3 grades of 
the 16 students whose Xi grades are 200. Table 50 shows the 
X 2 and Z 3 grades of these 16 freshmen, selected from Table 47. 


* See Table 34 (p. 358). 
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Table 47.— Grades of 81 Mount Holyoke Freshmen in Verbal Scho¬ 
lastic-aptitude Test, in College Board English Examination, and in 
First- and Second-semester English 
(A, B, C, and D grades converted to a numerical scale) 


Student 

number 

Second-semester 
English grade 
Xx 

First-semester 
English grade 

•Verbal 

S.A.T. 

Xz 

College Board 
English 
examination 

Xx 

1 

240 

220 

573 

407 

2 

200 

180 

473 

456 

3 

260 

240 

680 

615 

4 

260 

260 

568 

671 

5 

160 

160 

437 

546 

6 

240 

220 

567 

615 

7 

220 

200 

671 

449 

8 

60 

120 

415 

428 

9 

220 

240 

434 

726 

10 

200 

180 

437 

504 

11 

220 

220 

443 

601 

12 

140 

180 

583 

449 

13 

160 

120 

384 

553 

14 

240 

260 

634 

546 

15 

260 

240 

477 

671 

16 

200 

160 

594 

553 

17 

200 

160 

495 

449 

18 

240 

240 

592 

532 

19 

240 

220 

598 

574 

20 

240 

220 

536 

650 

21 

160 

140 . 

408 

532 

22 

220 

240 

569 

518 

23 

200 

200 

443 

477 

24 

100 

100 

421 

518 

25 

160 

140 

356 

366 

26 

200 

160 

535 

449 

27 

180 

180 

415 

671 

28 

180 

160 

502 

539 

29 

240 

240 

477 

532 

30 

200 

200 

504 

546 

31 

200 

200 

431 

518 

32 

180 

160 

442 

518 

33 

260 

220 

509 

518 

34 

160 

120 

449 

560 

35 

240 

240 

! 416 

539 

36 

220 

220 

531 

587 

37 

220 

240 

518 

622 

38 

160 

120 

472 

532 

39 

200 

200 

619 

643 

40 

220 

220 

456 

463 
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Table 47. —Grades op 81 Mount Holyoke Freshmen in Verbal Scho¬ 
lastic-aptitude Test, in College Board English Examination, and in 
First- and Second-semester English.— Continued ) 
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first-semester English grade. X3 ■= verbal scholastic-aptitude test. 
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Table 50.— and X3 Grades of 16 Students Whose Xi Grade Is 200 


Student 

number 

Xi \ 

Xi 

2 

180 

473 

10 

180 

437 

16 

160 

594 

17 

160 

495 

23 

200 

443 

26 

160 

535 

30 

200 

504 

31 

200 

431 

39 

200 

619 

44 

220 

424 

45 

200 

515 

49 

200 

479 

56 

220 

489 

70 

200 

391 

73 

180 

477 

81 

180 

479 


Figure 119 shows the scatter of these 16 bivariates; it indicates 
that the regression may be linear although there appears to be 
little tendency for X 2 and X 3 , unaffected by Xi, to be correlated 

160 180 200 220 240 280 


350 -I 


400 
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2 

1 


500 
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1 
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550 _ 
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600 _ 
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1 




X3 

Fig. 119.—Net regression of X 2 on X 3 , holding X\ constant. Test based on 

Xi = 200. 

with each other. Evidently violence is not done to the facts 
by assuming that any correlation present is linear in character. 
It must be remembered, of course, that this preliminary test of 
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net regression between X 2 and Z 3 , holding Xi constant, is based 
on a small sample of only 16 observations. Similar tests, holding 
Xi constant at several other values, respectively, should be 
made, especially if nonlinearity is suspected. 

Inasmuch as Table 34 (page 358) shows such a clear linear 
total regression between Xi and X 2 , it may be assumed that the 
net regression of Zi on X 2 , and vice versa, is linear. For further 
illustration of the method of examining the multivariate fre¬ 
quency distribution to test for linearity of regression. Table 51 
presents the joint variation in Xi and Xz grades of those students 
whose X 2 grade is 220. This will make it possible to test the 
linearity of net regression between Xi and X 3 , holding X 2 
constant. 

Table 51 .—Xi and X 3 Grades op 18 Students Whose X 2 Grade Is 220 


Student 

number 

Xi 

X 3 

1 

240 

573 

6 1 

240 

567 

11 

220 

443 

19 

240 

598 

20 

240 

536 

33 

260 

509 

36 

220 

531 

40 

220 

456 

44 

200 

424 

48 

280 

581 

50 

220 

579 

’ 52 

240 

610 

54 

220 

458 

56 

200 

489 

57 

220 

567 

62 

240 

596 

69 

240 

703 

79 

220 

432 


The bivariate frequency distribution of the eighteen cases in 
Table 51 is plotted in Fig. 120, which shows that their scatter, 
with the exception of one case, follows a linear path. It may be 
concluded that the net regression between Xi and X 3 approxi¬ 
mates the linear form. Here again, especially if nonlinearity is 
suspected, similar tests using several different values of X 2 , 
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respectively, should be made. Tables 50 and 51 serve only to 
illustrate the method, which would ordinarily need to be applied 
more completely than is done here. 

200 220 240 260 280 300 Xi 
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Xt 

Fig. 120 .—Net regression between X\ and X 3 , holding X 2 constant. Test 
based on X 2 = 220. 

Statistics of the Trivariate Frequency Distribution. Study of 
the trivariate frequency distribution X\, X 2 , and X 3 , shown in 
Table 47, appears to indicate that regressions are linear, and 
therefore it may be assumed that the methods of correlation 
outlined in Chap. XVI may appropriately be applied. In order 
to calculate the statistics for a trivariate frequency distribution, 
it is necessary to obtain first the statistics for the various mono¬ 
variate distributions. 

Calculation of Zero-order Statistics, Correlation tables 48 
and 49 may be used as work sheets, as illustrated in Chap. XIV, 
to calculate the zero-order statistics. Since the problem is now 
one of analyzing a trivariate frequency distribution, it is well 
to set up a schedule for calculation of the respective statistics. 
Table 62 is a schedule of the means and variances for the three 
monovariate frequency distributions taken separately. In each 
case the mean was calculated by using the formula 
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Table 62.—Means and Variances 


Means 

Sums of squares 

Variances r 

Standard 

deviations 

Si = 217.4 

Nal = 156,355.56 

<r\ = 1,930.3155 

<ri = 43.94 

II 

N<rl = 181,155.56 

<r| = 2,236.4883 

<r2 = 47.29 

S, = 515.5 

Nal = 719,222.22 

= 8,879.2866 

(Ts = 94.23 


The sum of squares of deviations from the mean in each case is 
calculated by using the formula 


iV(r2 = 




The required sums are all found in the correlation tables, for 
example, 


N<tI = 400 543 - = 400(543 - 152.11111) 


= 156,355.56 


Table 53 is a schedule for the calculation of zero-order Pear- 
sonian coefficients of correlation, using the equation that was 
used in Chap. XIV. This equation is shown at the head of Table 
53. The entries all come from Tables 48 and 49 and Table 34 
(page 358). 


Table 53.—Calculation of Simple r^s 




(1) 

(2) 

(3) 

(4) 

(5) 


(0 

-0 0 

-( 9(0 

(1) - (2) 

• (4) 4- (3) 

ri2 

455.00 

78.11111 

420.74964 

376.88889 

+0.89576 

ri3 

283.00 

-68.51852 

558.90486 

351.51852 

+0.62894 

r2z 

322.00 

-35.18519 

601.59915 

357.18519 

+0.59373 


Calculation of First-order Statistics. As suggested in Chap. 
XVI the first-order statistics of a trivariate frequency distri¬ 
bution may be calculated by several methods. The most efficient 
method appears to be to calculate the first-order jS^s aiid from 
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the first-order /3’s to calculate the other first-order statistics. 
Table 54 is a work sheet for the orderly calculation of the first- 


Table 54.— Calculation op the First-order jS^s from the 
Zero-order 


^ Ttj — rarjk 

Pij.k — ^ 2 

[See Chap. XVI, Eqs. (9)] 


(1) 

(2) 

(3) 

(4) 

(5) 

Zero-order r (/3) 




First-order 



Product 

Whole 

numerator 




Subscript 

Regression 

statistic 

term of 
numerator 

1 - r2 

Subscript 

Regression 

statistic 

12 

0.89576 

0.37342 

0.52234 

0.64748 

12.3 

0.80673 

13 

0.62894 






23 

0.59373 






13 

0.62894 

0.53184 

0.09710 

0.64748 

13.2 

0.14997 

12 

0.89576 






23 

0.59373 






12 

0.89576 

0.37342 

0.52234 

0.60444 

21.3 

0.86417 

23 

0.59373 






13 

0.62894 






23 

0.59373 

0.56338 

0.03035 

0.60444 

23.1 

0.05021 

12 

0.89576 






13 

0.62894 






13 

0.62894 

0.53184 

0.09710 

0.19762 

31.2 

0.49135 

23 

0.59373 






12 

0.89576 






23 

0.59373 

0.56338 

0.03035 

0.19762 

32.1 

0.15358 

13 

0.62894 






12 

0.89576 







^ Note the internal checks in columns (2), (3), and (4), in which each of three values 
occurs twice; in column (2), the first and third, second and fifth, fourth and sixth figures 
check; in column (3), the same orders check; in column (4), the first and second, the third 
and fourth, and the fifth and sixth figures check. While not independent checks, they 
nevertheless give confidence in the accuracy of the work as it proceeds. 

If preferred, the &’s instead of the |S's could first be calculated, by using a similar table 
and the general formula 


bii.k 


bii — hikbkj 
1 — r^jk 


order in the illustrated trivariate frequency distribution. 
The entries in column (1) of the table are obtained from Table 53. 
Bearing in mind the symmetry in the formula shown at the head 
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of Table 54, the zero-order r’s, which are also the zero-order /3's, 
are copied in the order in which they occur in the formula. 
Consequently, the entries in column (2) are the products of the 
r^s in the second and third lines of each trio of r’s in column (1). 
The entry in column (3) is the first r of each trio minus the 
entry in column (2). The 1 — in column (4), which may be 
found by using a sine table, is for the third r in each trio of r’s 
in column (1). Thus, if the trios of r’s are properly arranged 
in column (1), which can be done by following the general formula 
at the head of the table, the symmetry of the work sheet facili¬ 
tates all necessary calculations. In using this work sheet, the 
first step is to write in column (5) the subscript for the first- 
order that is to be calculated; this subscript then determines 
the order of the zero-order r’s in column (1). The value of the 
first-order /3, entered in column (5), is found by dividing the entry 
in column (3) by the corresponding entry in column (4). 

The coefficients of partial correlation are readily calculated 
from the i3’s, as follows:^ 

^ 12.3 ^ 12 . 3 ^ 21.3 

= 0.80673(0.86417) = 0.69715 
^12.3 ~ 0.83496 

^ 13.2 ~ ^ 13 . 2 ^ 31.2 

= 0.14997(0.49135) - 0.07369 
ri3.2 = 0.27146 

“ ^23.1^32.1 

= 0.05021(0.15358) = 0.007711 
^23.1 “ 0.08782 


^ The coefficients of partial correlation could be checked by using any 
one of several formulas, as follows: 




Tij — rikrjk 


Vl - Vl - 


_ T 

rij.k — f>ij.k _ 

(Ti.k 


fij.k — Pij.k 


Vi - 


Vl - rl 


Uk 


These formulas all have the advantage that they determine the positive 
or negative sign of the partial r; but the partial r always has the same sign 
as its corresponding jS. Cf, also p. 460. 
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Thus is determined the arithmetical value of the first-order 
coefficients of partial correlation. Each coefficient of partial 
correlation is positive if the /S^s from which it is derived are both 
positive, negative if the /S’s are negative. The respective pairs of 
jS^s involved are never of opposite sign. 

The h statistics are calculated from the jS’s as follows: 

bij.k 
bl2.3 


613.2 

621.3 

623.1 

631.2 

632.1 


The first-order a statistics are calculated as follows: 

0>i.jk ” bi]\kXj hik.jXk 

ai.23 = 217.4074 - 0.74948(204.074) ~ 0.06992(515.4816) = 28.4156 

02.18 = 204.074 -0.93020(217.40740) -0.02520(515.4816) = -11.148 
03.12 = 515.4816 - 1.05380(217.4074) - 0.30601(204.074) = 223.929 

The equations of the three planes of regression are, therefore, 
as follows: 

X[ = 28.42 + 0.75X2 + 0.07X3 
x; = -11.16 + 0.93X1 + 0.025X3 
X3 = 223.93 + l. 05 Xi H O.3IX2 


/Q 

O 2 

0 80673 

0.80673 47^2915 


= 0.74948 
= 0.14997 
= 0.06992 
= 0.86417 
= 0.93020 
= 0.05021 
= 0.02520 
== 0.49135 
= 1.05380 


= 01 9 4.2300 

0.15358 47 2915 

= 0.30601 


43.9354 

94.2300 

47.2915 

43.9354 

47.2915 

94.2300 


94.2300 

4379354 


0.80673(0.92903) 

0.14997(0.46626) 

: 0.86417(1.07639) 
= 0.05021(0.50187) 
0.49135(2.1447) 
0.15358(1.9925) 
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If Xi is considered the dependent variable, it can be estimated 
from the first equation; if X 2 is considered the dependent vari¬ 
able, it can be estimated from the second equation; if Z 3 is 
considered the dependent variable, it can be estimated from the 
third equation. The second-order standard deviations, respec¬ 
tively, about the three planes of regression may also be 
calculated.^ 

- r^)(l - ry 

<^1.23 “ ^1(1 ^12) (1 “ ^13,2) 

= 1,930.3155(0.19762)(0.92631) = 353.3585 
O'!.23 ~ 18.7975 

<^Li 3 “ <^2(1 “■ ^12) (1 “ ^23.1) 

= 2,236.4883(0.19762)(0.99229) = 438.5672 
<r2.i3 = 20.9416 
<^ 3.12 ~ <^ 3(1 ^ 13 ) (1 ^M.i) 

= 8,879.2866(0.60444)(0.99229) = 5,325.616 
<^3.1^ = 72.9767 

The multiple-correlation coefficients, which also measure the 
goodness of fit of the planes of regression, may now be calculated 
as follows 

^hk 
•Ki.23 


22i.23 

•^2.13 

722.13 

^ They could also be calculated by using Eq. (19), (p. 415). Thus the 
calculation of <rf .23 could be checked not only by using 

<^1.82 “ <^l(l rf8)(l — 7*12.8) 
but also by using the following formula; 

= ^’iCl ““ iSl 2 . 37*12 — ^13.27*13) 

* The calculation of R may be checked by using Eq. (20), p. 417. 
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Tfi _ 1 5,325.616 _ . „ 

^ " 8,879.2866 “ ^ ~ 

= 0.4002 
-^3.12 ~ 0.6326 

An all-round check on the various calculations may be obtained 
by using Eq. (31), Chap. XVI, as follows: 

Table 55 

■ 

1 = &.] + ^ik.j + ^njpij.k^ik.j + —Jf 


1 “ ^12.3 + /3i 3.2 + 2r23 ^12.3 ^13.2 H-^ 

(0.80673)2 (0.14997)2 2(0.59373) (0.80673) (0.14997) 

1.0000 = 0.65081 + 0.02249 + 0.14367 + 0.18304 

1 = ^\z.\ + ^\l.Z + 2ri3 i323.1 ^21.3 + 

0'2 

(0.05021)2 (0.86417)2 2(0.62894) (0.05021) (0.86417) 

1.0000 = 0.00252 + 0.74679 + 0.05458 + 0.19608 

^2 

1 =* ^32.1 + /®31.2 + 2ri2 ^Z2A fizU2 4-^ 

^3 

(0.15358)2 (0.49135)2 2(0.89576) (0.15358) (0.49135) 

1.0000 = 0.02359 + 0.24142 + 0.13519 -f 0.59978 

Interpretation of Results Illustrated. The interpretation of 
the above statistics of a trivariate frequency distribution may be 
illustrated by assuming that it is desired to predict Xi, the 
second-semester grades of freshmen at the woman^s college 
selected. From the equation for the plane of regression of Xi 
on X 2 and X 3 , namely, X[ = 28.42 + 0 . 75 X 2 + O.O 7 X 3 , esti¬ 
mates may be made of a freshman's grade in second-semester 
English if her grades in the verbal scholastic-aptitude test and 
in first-semester English are known. 

Estimates Based on Regression Equation. If a freshman's 
grade in first-semester English were 300 and her grade in the 
verbal scholastic-aptitude test were 600, her second-term English 
grade would be estimated at 

X[ 28.42 + 0.75(300) + 0.07(600) 

= 28.42 4- 225 + 42 
= 295 
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Since the second-term English grade will, of course, be affected 
by other factors, the student^s actual grade in second-semester 
English will deviate from estimates based upon the regression 
equation. This raises the question as to how much, on the 
average, it can be expected that estimates based on the regression 
equation will deviate from the actual values. The answer is 
found by the determination of the value of 0 - 1 . 23 , which has been 
found above to be 18.8, or approximately 19. The standard 
deviation of the differences between the actual grades and esti¬ 
mates based on the above regression equation is therefore about 
19. If this regression equation and first-order standard devia¬ 
tion are typical of these college grades and if the differences 
between actual and estimated values are in general normally dis¬ 
tributed, the chances are about that the actual value in any 
particular case will fall Avithin limits ± 38 (= 20 - 1 . 23 ) from the 
estimated value. 

The foregoing conclusion, which is based on the value of 0 - 1 . 23 , 
can be summarized very succinctly by the calculation of 
which has been found to be equal to 0.9039. This is a fairly 
high coefficient of multiple correlation. It shows that the above 
plane of regression is a good fit, and therefore estimates based 
upon it can be expected to be fairly good. 

Partial-correlation Coefficients, Since both b statistics in the 
equation of regression are positive, it is known that the net 
correlations between Xi and X 2 and between Xi and Xz are posi¬ 
tive. The amount of the net correlation is given by the coeffi¬ 
cients of partial correlation 7 * 12.3 = 0.83496 and ri 3.2 = 0.27146. 
These show that second-semester English grades are much more 
closely related to first-semester English grades than they are to 
verbal scholastic-aptitude test grades. 

Analysis of Variance in X\, From the /3^’s and the p cross 
products, analysis of the variance in second-semester English 
grades can be made. Thus, from the first set of and cross 
products in Table 55, it is seen that 65.1 per cent of the variance 
in second-semester English grades, Xi, is accounted for by direct 
association with first-semester English grades. Only 2.2 per 
cent is accounted for by direct association with verbal scho¬ 
lastic-aptitude test grades, although 14.4 per cent of the variance 
in second-semester English is accounted for by indirect asso¬ 
ciation with both first-semester grades and verbal scholastic- 



Table 56 


462 


lATES 



second-semester English grade, X 4 = College Board English examination. 
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aptitude test grades. The variation in other influences accounts 
for 18.3 per cent of the variance in second-s^emester English 
grades. 

Under conditions existing at the woman^s college studied, it 
appears to be an inevitable conclusion that knowledge of grades 
in verbal scholastic-aptitude tests is not so helpful as might be 
supposed in predicting the subsequent performance of college 
freshmen students. 

Extension of Analysis to Include Four Variables. Additional 
Zero-order Statistics. The extension of the trivariate frequency 
distribution to include a fourth variable X 4 requires first the 
calculation of the mean and standard deviation of the added 
variable. It requires also the calculation of the simple corre¬ 
lation coefficients between the new variable and each of the 
other three. For illustration, the fourth variable taken is the 
grade in the College Board English examination. Tables 56 to 
58 are the usual work sheets for a correlation problem. From 
them the necessary data are obtained for calculating the addi¬ 
tional zero-order statistics, as follows: 

X 4 = 549.8889 ru = 0.49106 

Ncrl = 501,201.9828 ^24 = 0.48807 

al = 6,187.6788 r 34 = 0.31551 

<r 4 = 78.6618 

Additional First-order Statistics. Among four variables it is 
possible to distinguish four different sets of trivariate frequency 
distributions, each of which will have three planes of regression. 
Accordingly, when four variables are involved the total number 
of first-order ^ statistics is 24, two for each plane of regression. Six 
of these twenty-four were calculated in Table 54; the remaining 
18 may be obtained by a similar procedure. Table 59 shows the 
24 jS^s for the illustrated four-variable problem, grouped according 
to the four possible trivariate frequency distributions. 

Each of the four trivariate frequency distributions could be 
analyzed as illustrated in the preceding sections of this chapter. 
From the first-order / 3 ^s shown in Table 59 all the other first- 
order statistics may be obtained, by methods already explained. 

In few problems is it* necessary or even desirable to calculate 
all 24 first-order statistics of the four trivariate frequency dis¬ 
tributions involved in a four-variable set. As may be seen from 
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Table 59.—The Fibst-order /S’s in the Four Tbivariate Frequency 
Distributions for Four Variables 
Dcda on Jour hinds of grades of 81 college freshmeny at the Selected Woman's 

College 


First plane 

Second plane 

Third plane 

Trivariate Distribution X\, X 2 , Xz 

Pii.3 - 0.80673 
013.2 = 0.14997 

fe.a = 0.86417 
fe.i = 0.05021 

fe.2 = 0.49135 
1^32.1 = 0.15358 

Trivariate Distribution Xi, X^y X 4 

012.4 = 0.86124 
014.2 “ 0.07071 

021.4 = 0.86457 
024.1 = 0.06352 

/941.2 = 0.27260 
/342.1 == 0.24392 

Trivariate Distribution Xi, X3, X4 

013.4 = 0.52642 
014.3 = 0.32498 

031.4 = 0.62463 
034.1 = 0.00877 

041.3 = 0.48413 
043.1 = 0.01101 

Trivariate Distribution X2, X3, X 4 

033.4 0.48837 

024.3 = 0.33400 

i832.4 = 0.57724 
— 0.03378 

i842.3 = 0.46447 
Paz.2 ~ 0.03974 


an examination of Table 60, it is possible to calculate all the 
second-order statistics if only 18 of the 24 first-order B statistics 
are known. If one only of the four planes of regression in the 
four-variable correlation problem is significant or important, it is 
necessary to calculate only 8 of the first-order p statistics. 

8 econd‘Order Statistics in a Four-variable Problem. In the 
four-variable correlation problem, statistics for four planes of 
regression may be obtained. Following are the four possible 
regression equations: 

= ^1.234 + 612.34X2 + 613.24X3 -b 614.23X4 
X2 = a2.i34 + 621.34X1 + 623.14X3 + 624.13X4 
X3 = a3.i24 + 631.24X1 + 632.14X2 + 634.12X4 
X4 = ^4.123 + 641.23X1 -b 642.13X2 + 643.12X3 

Also, for each plane of regression a scatter and a coefficient of 
multiple correlation may be calculated. The procedure is 
similar to that already illustrated; that is to say, the second-order 
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^ s ftr6 first obt3»iii6d., d^nd. from th.Gm all the other second-order 
statistics are calculated. Table 60 illustrates the procedure for 
making the necessary calculations to obtain the 12 possible 
second-order 13 statistics. 

Calculation of Second-order Statistics, In a problem where the 
first-order partial coefficients of correlation are already calculated, 
it is advisable to modify the iormula for finding second-order p 
statistics from first-order statistics as follows: 

According to Eq. (39), Chap. XVI, it was found that 

Pii.kn -i ^ 3 

i — Pnj.kPjn,k 

But from Eq. (24), Chap. XVI, it is known that 

^jn.k ^nj.k^in.k 

Accordingly, the formula for finding the second-order p statistics 
can be modified as follows: 

^ _ ^tj.k *“ ^in.k^nj'.k 

Pij.kn — 1 _ „2 

-*■ Vn.fc 

In order to secure the greatest convenience in calculation, the 
arrangement of the items in the work sheet (Table 61) is accord¬ 
ing to the terms of this formula. First the desired subscript for 
the jS statistic to be calculated is entered in column (5); then, 
following the formula, the order in which the required trio of 
first-order /3^s appear in column (1) is determined. If this order 
is followed, the entry in column (2) is the product of the second 
two P’s of the trio in column (1); the entry in column (3) is 
found by subtracting the entry in column (2) from the first p of 
the trio in column (1); the subscript of the third p of the trio in 
column (1) is the subscript of the partial r for which 1 — is to 
be found in appropriate tables or, if preferred, calculated. The 
desired second-order p’s ,are then calculated, by dividing the 
entry in column (3) by the entry in column (4), and entered in 
column (5). 

In problems for which it is not desired to calculate the first- 
order coefficients of partial correlation, the alternative method 
illustrated in Table 61 may be used. It is to be noted that the 
only differences are that an additional p must be entered in 
columu (1) in each of the i^ets and that an additional column. 
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Table 60.—Calculation of the Second-order /3’s from the First- 

order jS’s 


in which 




^ij.k ^in.k^nj.k 


1 


r 


2 

jn.k 


^%.k — ^nj.k^jn.k 

[See Chap. XVI, Eqs. (24 and 39) j 


TO 

(2) 

W) 

m 

(5) 

.First-order /3 

Product 
term of 
numerator 

Whole 

numerator 

1 1 — T^jn.k 

Second-order 

Subscript 

Regression 

statistic 

Subscript 

Regression 

statistic 

12.3 

0.80673 

0.15094 

0.65579 

0.84487 

12.34 

0.77620 

14.3 

0.32498 






42.3 

0.46447 






13.2 

0.14997 I 

0.00281 

0.14716 

0.99866 

13.24 

0.14736 

14.2 

0.07071 






43.2 

0.03974 






14.2 

0.07071 

0.00507 

0.06564 

0.99866 

14.23 

0.06573 

13.2 

0.14997 






34.2 

0.03378 






21.3 

0.86417 

0.16170 

0.70247 

0.84267 

21.34 

0.83362 

24.3 

0.33400 






41.3 

0.48413 






23.1 

0.05021 

0.00070 

0.04951 

0.99990 

23.14 

0.04951 

24.1 

0.06352 






43.1 

0.01101 1 






24.1 

0.06352 

0.00044 

0.06308 

0.99990 

24.13 

0.06309 

23.1 

0.05021 






34.1 

0.00877 






31.2 

0.49135 

0.00921 

0.48214 

0.98072 

31.24 

0.49162 

34.2 

0.03378 






41.2 

0.27260 






32.1 

0.15358 1 

0.00214 1 

0.15144| 

0.98451 

32.14 

0.15382 

34.1 

0.00877 






42.1 

0.24392 1 






34.2 

0.03378 ' 

0.03474 

-0.00096 

0.98072 

34.21 

-0.00098 

31.2 

0.49135 






14.2 

0.07071 1 

1 

1 

1 



41.2 

0.27260 

0.01953 

0.25307 

0.92631 

41.23 

0.27320 

43.2 

0.03974 1 






31.2 

0.49135 






42.1 

0.24392 

0.00169 

0,24223 

0.99229 

42.13 

0.24411 

43.1 

0.01101 






32.1 

0.15358 






43.1 

0.01101 

0.01225 

-0.00124 

0.99229 

43.12 

-0.00125 

42.1 

0.24392 






23.1 

0.05021 
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Table 61. —Calculation of the Second-order jS’s from the First- 

order jS’s 

{Alternative method illustrated) ^ 


p _ ^in^h&nj-k 

Pij-kn — Z r ^ 

-L Pnj^hPin-k • 


(1) 

(2) 

(3) 

1 

(4) 

(5) 

(6) 

First-order /S 

Product 
term of 
numerator 

Whole 

numer¬ 

ator 

Product 
term of 
denomi¬ 
nator 

Whole 

denomi¬ 

nator 

Second-order fi 

Subscript 

Regression 

statistic 

Sub¬ 

script 

Regres¬ 

sion 

statistic 

12.3 

0.80673 

0.15094 

0.65579 

0.155133 

0.84487 

12.34 

0.77620 

14.2 

0.32498 







42.3 

0.46447 







24.3 

0.33400 







13.2 

0.14997 

0.00281 

0.14716 

0.001342 

0.99866 

13.24 

0.14736 

14.2 

0.07071 







43.2 

0.03974 







34.2 

0.03378 







14.2 

0.07071 

0.00507 

0.06564 

0.001342 

0.99866 

14.23 

0.06573 

13.2 

0.14997 







34.2 

0.03378 







43.2 

0.03974 


1 


1 




If this method is used, the 6’s instead of the )8’s could be first calculated, using a similar 
table and the general formula 


hjj.k — bin.kbnj.k 
1 — bnj-kbfn.k 


column (4), is required in which to enter the product term of the 
denominator. The item in column (5) is then obtained by 
taking the complement of the corresponding entry in column (4). 
The second-order 13 is found by dividing the entry in column (3) 
by the entry in column (5). For convenience of arrangement, 
the product term of the numerator is written in the order 
Pin.Aj.k rather than and the product term of the 

denominator is arranged in the order finj.kfijn.k rather than 
Pjn.k0nj,k- Except for the convenience in arrangement of the 
work sheet, the order in which such product terms occur is 
immaterial; but, when arranged as indicated, once the subscript 
of l^e desired second-order is entered in column (6), the order 
in which the first-order /3^s occur in the equation may be followed 
in entering them in column (1). There are only four first-order 
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/3’s in each set, tor the third (in the numerator) is repeated in the 
first part thfe product term of the denominator. When this 
procedure as to arrangement in the work sheet is followed, the 
entry in column (2) is always the product of the two middle jS^s 
in the set of four in column (1), and the entry in column (4) is 
always the product of the last two /3’s entered in column (1). 

The second-order coefficients of partial correlation are cal¬ 
culated from the second-order jS’s as follows:^ 

“ Pi3\knfiji.kn 

or, for the four-variable case, 

‘^tj.kn ~ Pij.krSii.kn 

^L.34 = 0.77620(0.83362) = 0.647056 
ri2.34 = 0.80440 

ri3.24 = 0.14736(0.49162) = 0.072445 
ri3.24 = 0.26916 

^?4.23 = 0.06573(0.27320) = 0.017957 
ri4.23 = 0,13400 

^24.13 = 0.06309(0.24411) = 0.015401 
^24.13 = 0.12410 

^•^.14 = 0.04951(0.15382) = 0.007616 
^23.14 “ 0.08728 

^34.12 = -0.00098(-0.00125) = 0.000001225 
^ 34.12 = - 0.00111 

(The negative sign of the partial r is determined by the negative 
sign of the corresponding P statistic.) 

The h statistics of the second order are calculated from the 
second-order /3’s in the same way as the first-order 6^s from the 
first-order jS^s, by the formula 

5t/.in — ^ij.kn ' 

O'? 

or, for the four-variable problem, 


<Ti 


^ijJen — ^ij.kn 


612.34 = ^12.34 — 

<7-2 

^ For checking or alternative formulas to find the partial coefficients of 
correlation, see p. 447. 
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= 0.77620(0.92903) 

= 0.72111 

bu.u = 0.14736(0.46626) 

= 0.06871 

614.23 = 0.06573 ( ig'eois ) ^ 0.06573(0.55854) 

= 0.03671 

621.34 = 0.83362(1.07639) 

= 0.89730 

623.14 = 0.04951(0.50187) 

= 0.02485 

624.13 = 0.06309 (mm) = 0.06309(0.60120) 

= 0.03793 

634.12 = -0.00098 (mm) "" -0-00098(1.19791) 

= -0.00117 

631.24 = 0.49162(2.1447) 

= 1.05438 

632.14 = 0.15382(1.9925) 

= 0.30650 

641.23 = 0.27320 (miM) “ 0.27320(1.79040) 

= 0.48914 

642.13 = 0.24411 ( 472^ ) = 0.24411(1.66334) 

= 0.40604 

643.12 = -0.00125 01 - ^3 ^) = -0.00125(0.83478) 

= -0.00104 

It will be noted that, with the exception of those involving 0-4, 
the standard-deviation ratios used in the above calculations 
have all been computed and may be copied from the preceding 
section, where the first-order Vs were calculated from the first- 
order fi^s. 

The second-order a statistics are calculated as follows: 

^ujkn “ Xi hij.knXj hik.jnXk Vn.jkXn 

«i.234 = 217.4074 - 0.72111(204.074) - 0.06871(515.4816) 

- 0.03671 (549.8889> 

^ 14.64244 
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02.134 = 204.074 - 0.89730(217.4074) - 0.02485(515.4816) 

- 0.03793(549.8889) 

= -24.67266 

03.124 = 515.4816 - 1.05438(217.4074) - 0.30650(204.074) 

+ 0.00117(549.8889) 

= 224.34628 

04.231 = 549.8889 - 0.40604(204.074) + 0.00104(515.4816) 

- 0.48914(217.4074) 

= 361.22013 

The equations for the four planes of regression may now be 
written as follows: 

X'l = 14.64 + 0 . 721 X 2 + 0.069X3 + 0.037X4 
X; = -24.67 + 0.897Xi + O.O 25 X 3 + O.O 38 X 4 
X '3 = 224.35 + 1.05Xi + O. 3 O 6 X 2 - O.OOI 2 X 4 
X'4 = 361.22 + 0.489Xi + O.4O6X2 - O.OOIX3 

If Xi is considered the dependent variable, it can be estimated 
from the first equation; if X 2 is considered the dependent variable, 
it can be estimated from the second equation; if X 3 is considered 
the dependent variable, it can be estimated from the third 
equation; if X 4 is considered the dependent variable, it can be 
estimated from the fourth equation. The standard errors of 
estimate, that is, the scatters, respectively, about the four 
planes of regression may also be calculated.^ 

'’A.jkn — ~ ^in.jk) 

2 __ 2 /-| 2 \ 

<l'’l.234 — 0’l.23'^ *"14.23/ 

= 353.34(0.98204) 

= 346.9940 

<ri.234 — 18.628 

2 ___ 2 /-I 2 \ 

<^2.134 “ <^2.13\J- “ ^24.13/ 

= 438.53(0.98460) 

= 431.7766 
<^2.134 “ 20.779 

^Li24 ~ ^L. 12 ) 

= 5,325.56(1.00000) 

= 5,325.56 
o’s.m = 72.976 


^ For alternative methods, see p. 415 and Eq. (19), Chap. XVI. 
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2 _ 2 /I 2 \ 

^4.123 0’4.12VA “■ ^43.12/ 

= 4,622.8210(1.00000) . 

= 4,622.8210 
o’4.i23 — 67.991 

The multiple-correlation coefficients, which measure the good¬ 
ness of fit of the planes of regression, are calculated in the same 
way as for the trivariate problem, namely,^ 


1 - 0.17976 


1 - 0.19306 


1 - 0.5998 


1 - 0.74710 

For the four-variable problem, the equation for the ^ squares 
and /d cross products is as follows: 

Mi.kn + fiik.jn + ^L.ik + 2 rjkPij,knPtk.jn + 2rjn0ij.kn0in.3k ^ 

-j- 2rkn0ik.j7i0in.jk + = f 


Rljkn 

Ri.2U 

^ 1.234 

•^2.134 

R2.UA 

R\.\2A 

•B 3 .I 24 

R\.m 


1 


= 1 - 


346.994 


1,930.3155 
0.8202 
0.9056 

431.7766 


= 1 - 


2,236.4883 
0.8069 
0.8983 

5,325.56 


= 1 - 


8,879.2866 
= 0.4002 
= 0.6326 

4,622.8210 
" ^ 6,187.6788 

= 0.2529 
724.123 “ 0.5029 


In Table 62 some of these checks are illustrated. 

Interpretation of Results Illustrated. The interpretation of 
the above statistics of a four-variable frequency distributior 
may be illustrated by assuming that it is desired to predict tht 
second-semester English grades of freshmen at the woman u 
college selected; in other words, the is assumed to be the 

^ For an alternative method, see Eq. (20), Chap. XVI. 
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dependent variable. From the equation for the plane of regres¬ 
sion of Xi on X 2 , X 3 , and X 4 , namely, 

Z'l = 14.64 + O. 72 IZ 2 + 0.069X3 + 0.037X4 

estimates may be made of a freshman^s grade in second-semester 
English if her grades in the verbal scholastic-aptitude test, in 
College Board English, and in the first-semester freshman 
English course are known. 

Estimates Based on Regression Equation. If a freshman’s 
grade in first-semester English is 300, in the verbal scholastic- 
aptitude test 600, and in College Board English 500, her second- 
semester English grade is estimated as follows: 

X'l = 14.64 + 0.721(300) + 0.069(600) + 0.037(500) 

= 14.64 + 216.3 + 41.4 + 18.5 
= 291 

Since the second-semester English grade will, of course, be 
affected by other factors, the student’s actual grade in second- 
semester English will deviate from estimates based upon the 
regression equation. This raises the question as to how much 
on the average it can be expected that estimates based on the 
regression equation will deviate from the actual values. The 
answer is found by the determination of the value of 0 - 1 . 234 , 
which has been found above to be 18.6, or approximately 19. 
The standard deviation of the differences between the actual 
and the estimated grades in second-semester English is therefore 
about 19. If this regression equation and second-order standard 
deviation are typical of these college grades and if the differences 
between actual and estimated values are in general normally 
distributed, the chances are about that the actual value in a 
particular case will be within limits ±38(= 20 - 1 . 234 ) from the 
estimated value. 

The foregoing conclusion, which is based on the value of 
o'i. 234 , can be summarized very succinctly by the calculation of 
K 1 . 234 , which has been found to be equal to 0.9056. 

If this result is compared with the estimate based on only two 
independent variables, it is found that the standard error of 
estimate is almost as large for the plane based on three independ¬ 
ent variables as the standard error of estimate based on two 
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independent variables.^ In other words, very little increase in 
accuracy was obtained by including the fourth variable into 
the correlation problem. This same conclusion is borne out 
by comparing the coefficients of multiple correlation. Thus 
-Bi .234 = 0.9056, while fli .23 = 0.9039, which is nearly as large, 
indicating that the trivariate plane was nearly as good a fit as 
the four-variable plane of regression. 

Partial-correlation Coefficients., The unimportance of knowl¬ 
edge of grades in College Board English examinations in pre¬ 
dicting the grades of freshmen in second-semester English is 
explained also by the small partial-correlation coefficient between 
Xi and X 4 when X 2 and Xz are held constant. This partial- 
correlation coefficient is given as r 14.23 = 0.1340. 

Analysis of Variance in Xi. These conclusions are further 
indicated by the nature of the squares and the ^ cross-product 
terms. From the first equation in Table 62 it is seen that the 
various proportions of variance in Xi are accounted for as 
follows: 

60.25 per cent by correlation with first-semester English grades. 

2.17 per cent by correlation with verbal scholastic-aptitude 
tests. 

0.43 per cent by correlation with College Board English exami¬ 
nations. 

13.58 per cent by indirect correlation with first-semester English 
grades and verbal scholastic-aptitude tests. 

4.98 per cent by indirect correlation with first-semester English 
grades and College Board English examinations. 

0.61 per cent by indirect correlation with verbal scholastic- 
aptitude tests and College Board English examinations. 
17.98 per cent by correlation with other factors independent of 
first-semester English grades, verbal scholastic-aptitude 
test grades, and College Board English examinations. 

The small percentages attributable to College Board English 
examination grades, either directly or indirectly, are apparent 
from these statistics. Evidently, under conditions existing at 
the woman’s college, grades on the College Board English 


1 Cf p. 451. 
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examination were of little value for predicting how well the 
students would do in their college freshman Engjish courses.^ 
Another approach to the study of variance in Xi could be 
made as follows: It was noted above for three variables that^ 


= rUal + 

For four variables, 

+ r?3.2<^L + rf4.230-?.23 + <^ 1.2 
which may be expressed in proportions as follows: 

+ 


2 "*2 

1 — 4- r2 ^ 4- r2 ^i-23 , 0-1.234 

J- — ^12 "T ^13.2 2 ^ '14.23 I- 

ai (Ti ai 


This expression means that the total variance in Xi is com¬ 
posed of four parts as follows: the part that is due to total simple 
linear correlation with X 2 , the part that is due to partial correla¬ 
tion with Xz when X 2 is held constant, the part that is due to 
partial correlation with X 4 when X 2 and Xz are held constant, 
and the part due to other causes independent of X 2 , Xz, and X 4 . 

2 

The expression 2 describes the proportion of the variance 

in XI that is explained as a result of adding X 3 to the regression 

2 

equation, while ri 4 23 describes the proportion of the variance 
1 

in Xi that is explained as a result of adding X 4 to the regression 
equation; the influences of X 3 and Z 4 that result from their 
association with X 2 are already contained in By sub¬ 

stituting the values of the four above terms in the illustrated 
problem, it becomes 


1.00000 = 0.80238 + 0.07369 1930 3 ^ 1 ^ 


+ 0.017957 

+ 


353.358 

1,930.3155 

346.9940 

1,930.3155 


or 


1.00000 = 0.80238 + 0.01456 + 0.00329 + 0.17977 


^ It will be noted, however, that ru = 0.49 so that approximately 25 per 
cent [ = (. 49 )^] of the variation in X\ may be estimated from knowledge of X 4 
alone. 

" Cf. Chap. XVI, Eq. (33), p. 428. 
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From this expression it may be said that 80.2 per cent of the 
variance in Xi is accounted for by total correlation with X 2 , a 
further 1.4 per cent is accounted for by additional correlation 
with Xz, and a further 0.3 per cent is accounted for by additional 
correlation with X 4 , the remaining 18 per cent being due to other 
influences independent of X 2 , X 3 , and X 4 . In other words, by 
making a four-, instead of a three-variable correlation problem, 
that is, by including the College Board En^ish examination 
grades, only an additional 0.3 per cent of the variance in second- 
semester English grades is explained. 



CHAPTER XVIII 


NORMAL FREQUENCY SURFACE 


THE BIVARIATE HISTOGRAM 

The study of frequency surfaces begins logically with a geo¬ 
metrical representation of a bivariate frequency distribution 
known as a “bivariate histogram/’ To visualize the histogr^ 
that would represent the distribution of Table 25 (page 
consider an ordinary checkerboard. Let the side and j/ 
the board be cahbrated with the class-interval scale sliiP^^^ 
Table 25, and let 81 checkers be taken to represent b* 
students. On the checkerboard square in the row head^ 60- 
and the column headed 120-, let one checker be placed; p ^be 
square in the row headed 100- and the column headed 
two checkers be placed; on the square in the row headed 1C0~ 
the column headed 100-, let one checker be placed; and!so on, 
until all the squares on the checkerboard for which thpre are 
frequencies in Table 25 are covered with the proper nuri^ber of 
checkers piled on top of each other. \ 

If the checkers were square rather than round, they would 
stand up better and fill in all the area, helping to suppc^l^ o^ch 
other. If they were square, the resulting figure would r >semble 
a histogram for the given bivariate frequency distributfon. A 
picture of what such a histogram would look like is iP^en in 
Fig. 121. 

In the foregoing example the heights of the varioui^ pilos of 
checkers represented the frequency of each cell. It yould be 
possible however, so to adjust the vertical scale that th^ heights 
of the piles of checkers represented the relative freq^oncy of 
each cell. If the checkers were square, giving a l^togram 
proper, then, further, it would be possible to adjust thf''cortical 
scale so that the volume of each pile of square checkers ;heasured 
the relative frequencies. For example, since the class ^^^^orvals 
are 20 units each and the area of any ceU is thus 4(r0 square 
units, the height of a pile of checkers taken to measy^o a rela- 

469 
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tive frequency of, say 0.08, would be 0.0002 unit. This would, 
of course, be very small; but then, in any model, the vertical 
unit could be taken sufliciently large to offset this. That is> 
instead of letting i inch represent 1 unit (the thickness of one 
checker, say), it would be possible to let 10,000 inches represent 
1 unit. Then 0.0002 units would be the equivalent of a pile of 
eight checkers. 




a histogram is constructed so that volumes 
he sqi^are checkers erected on each cell represent the relative 
v^i of that cell, and suppose that the number of cases is 
increased and at the same tiijae the size of the class 
is made infinitesimally small. The result would be a 
e the top of which would tend to trace out a smooth 
[This would be a frequency surface. A frequency sur- 
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face is thus the limit approached by a bivariate histogram as the 
number of cases is indefinitely increased and the size^ of the class 
intervals indefinitely reduced. If an aiea is traced out in the 
X 1 X 2 plane, the relative frequency of cases falling in this area is 
given by the volume under the surface over that area. 

FREQUENCY SURFACES 

Frequency surfaces may assume all sorts of shapes. They may 
be symmetrical and bell-shaped, or they may be distorted by 
skewness or excessive peakedness or flatness, depending on the 
types of forces underlying the variation in the two variables- 
First will be considered the case of a bivariate surface for variable® 
that are normally distributed and are independent of each 

Bivariate-surface, Independent Variables. A monov^^?^® 
frequency distribution, it Avill be recalled, showed the lelativh^., 
frequency of occurrence of various values of a given varialle. A 
joint, or bivariate, frequency distribution shows the lelative 
frequency of occurrence of various pairs of values of me two 
given variables. Suppose, for example, that a markiinan is 
shooting at a target. The scatter of dots about the cmter of 
Fig. 122 may be taken to illustrate the results of a large mamber 
of such shots. The position of any particular shot relltive to 
the center of the target may be indicated by the amoult of its 
horizontal deflection (call it X 2 ) and by the amount of it^ertical 
deflection (call it Xi). The relative frequencies of varicAs types 
of shots may consequently be indicated by the relatve fre¬ 
quencies of various combinations of horizontal andivertical 
deflections, that is to say, of various pairs of values of xt and X 2 - 

The relative frequency of shots in any given area of tl4 target, 
the X 1 X 2 plane shown in Fig. 122, may be indicated by thi density 
of shots in that area or by the volume of some frequence surface 
constructed over the X 1 X 2 plane. The use of the surfaci for this 
purpose is illustrated in Fig. 123. I 

It will be noted that the shots tend to be distributed sym¬ 
metrically around the center of the target. No tendency for 
large vertical deviations to be assoc^ted with large irorizonUl 
deviations in either a positive or a negative direction i| evident. 
Also, no tendency for vertical deviations to vary in lany' par¬ 
ticular way with horizohtal deviations is apparent. 
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Normal Bivariate-surface, Independent Variables. Figure 123 
illustrates the normal bivariate-surface independent variables; it 

Xj X2 __ 

12 

II 

10 


8 

7 

6 


. (-1-1-1_I_1-L_J_ 1 _ I I - I I l_ Y. 

[ I 2 3 4 5 6 7 8 9 iO 11 ^ 

FiG.pii.—Distribution of shots at a target, representing a symmetrical bivariate 
“ distribution. 



an(^the example described above illustrate the characteristics 
of ayva date digtributiou where there ia no correlatipn between 
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the two variables. This may be summarized as follows: There 
is no correlation, that is to say, the variables are independent of 
each other, because (1) for any given value of Xi, the distribu¬ 
tion of values of X 2 is the same, with the same mean and standard 
deviation, as for any other value of Xi; (2) for any given value 
of X 2 , the distribution of values of Xi is the same, with the same 
mean and standard deviation, as for any other value of X 2 . 
When each variable is the result of a set of forces that will produce 
a normal frequency distribution in that variable alone and when 
the two sets of forces operate independently of each other, the 


result will be a normal bivariate frequency distribution with no^ 
correlation. The easiest way of generating a normal bivari^ 
frequency surface is to suppose that a form of the normal frequgp 
curve is held in a position perpendicular to the base plane*^^^ 
Fig. 124. A knob is fixed to the top of the frequency 
jB, and the center of the base 

' ja 

line of the frequency curve is ♦ 

fixed at A, so that it can revolve j \ K ,ri 

but so that the line BA always y/ B ^ 

remains perpendicular to the / j \ m// 

base plane CD, / / 

If the form of this normal / --<1 U. U 1 

frequency curve is revolved in a / / M 

complete cycle until it reaches / / /g 

its original position again, the CA—-^-1/1 

frequency curve will ‘‘describe'' ^^ I 

the surface of the bivariate i 24 ^_The nor J curve, 

normal frequency surface for in- revolution of which wi* produce 
dependent variables, and it will I 

be like a system of symmetrically concentric circleJsuch as 
that shown in Fig. 123. For such a distribution ompairs of 
observations Xi and X 2 , r = 0, for the xix^ productmre dis¬ 
tributed equally in the four quadrants, minus products liinceling 
plus products. I 

Mathematical Representation of Normal Bivariat^urface, 
Independent Variables. of the Xi and X 2 as the O^in. As 
noted in the discussion of Fig. 122, the various X 1 K 2 poiul^s 
plotted in a .bivariate plane may, with no difficulty, b^iescnbed 
in terms of their distances from the respective means. | ThfejnaSi 
the effect of shifting the axes so that the new axes art the 


;1 curve, 
produce 


surface; 

nn. As 

2 poiul^s 
escribed 

rhisfia^ 

ihepae§ 
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drawn perpendicular to the means of the respective scales; the 
vertical line drawn through X 2 in Fig. 122 is the rri-axis, and the 
horizontal line drawn through Xi in Fig. 122 is the a: 2 -axis. 
Vertical and horizontal deviations from the center of the circle 
are Xi and variates. For many purposes it is more convenient 
to use this method of describing points in a bivariate plane than 
to use the original scales as the point of reference. In the follow¬ 
ing pages, the more frequent appearance of Xx and x^y instead of 
the capital letters, will be understood to signify the shift from 
reference to the original axes to reference to the axes with the 
origin at the means of the two variables. 

Probability of Each Variate Taken Separately, If Xi is a nor- 
i^ally distributed variate above and below the Xx and is com- 
pV^ely independent of x^j the probability or relative frequency of 
^^^^alue of Xx between xx and Xx + dxxy whether associated 
"wml large or mth small values or with positive or negative 
of X 2 , will be given by 


dPixx) = - 7 = e dxi ( 1 ) 

<ri v2^ 

ISinflarly, if X 2 is a normally distributed variate and is completely 
ind^^iident of Xxy the probability or relative frequency of any 
vali|B of X 2 between X 2 and X 2 + dx 2 y whether associated with 
largi f>r with small values of Xx or with positive or negative 
values of Xxy will be given by 


1 

dP(x2) = - 

(T2 \/27r 


( 2 ) 


Joi^i Probability of Two Variables, The joint probability or 
joint ^lative frequency of an Xx between Xx and Xx + dxx occur¬ 
ring i 4 association with an X 2 between X 2 and X 2 + dx 2 is the 
product of the above two probabilities. In other words, the 
joint probability of the two variables occurring in pairs of any 
ec^bination is given by 


1 

■dP(xxX2) = - 7 =- y=.e ^^^dxidx2 (3) 

O’! y/2ir <r2 v2ir 
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which reduces to the following form: 



Fig. 125.—A normal bivariate frequency surface, independent variables. [Here 

Cl > C2I. 

Geometrically, the dP(xiX 2 ) expressed in Eq. (4) describes the 

volume of a column with breadth and width of dxi and dx 2 and a 
_ 1 / 

height equal to —e ^ Vi* Such a column is shown at 

P in Fig. 125. 

The normal bivariate surface may be described, therefore, 
as follows: 

1 1 X2 ^\ 

f{xiX 2 ) = — e 2 Vi* (5) 

If the two standard deviations are equal, the normal prob¬ 
ability surface is circular like Fig. 123. Horizontal planes 
parallel to the base will intersect the figure in the form of (Circles 
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becoming smaller as the plane is elevated. Any vertical plane 
parallel with the xi-axis (a line through the Xi) will intersect the 
figure in the form of a normal ^ 

curve with a standard deviation 

equal to <ti, and any vertical N. 

plane parallel with the X 2 -axis will / \ 

intersect the figure in the form of / \ 

a normal curve with a standard / \ 

deviation equal to <r 2 . If the two / \ 

standard deviations are equal, 
these normal curves wdll be 

identical. - 

If, however, the two standard 
."^dations are not equal, the 

I I bivariate surface will be \ , 

3al in form, as shown in \ / 

L25, rather than circular. \ / 

al planes drawn as before \ / 

evertheless bisect contours x. 

'mal curves. The vertical 1 

1 curves will have standard Fig. 126.—a horizontal section of 
ions equal to (n, and the the frequency surface of Fig. 125 . 

ntal normal curves will have standard deviations equal to 
-orizontal planes parallel to the base in Fig. 125 will inter- 
le figure in the form of ellipses, which will become smaller 
t plane is elevated. Figure 126 is the sort of ellipse that 
wo^tbe obtained by the intersection of a plane horizontal to the 
basi^ lane of Fig. 125. The equation for the ellipse shown in 
Fi^lbis 

- 0,3x1 + 6.7x2 _ 32 = 0 


Xi = ± 


Pairs € xi and X 2 that satisfy this equation are: 


X2 

Xi 

0 

±10.3 

±0.5 

±10.0 

±1.0 

± 9.2 

±1.5 

± 7.5 

±2.0 

± 4.2 

±2.18 

0 
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Bivariate-surface, Correlated Variables. Instead of two inde¬ 
pendent variables, suppose there is a set of paired variables in 
which is displayed a marked tendency for positive correlation, so 
that large values of Xi are associated with large values of X 2 , 
and vice versa. This is the same as to say that positive values 



of xi occur predominantly with positive values of X 2 and nega| 
values of Xi occur predominantly with negative values of X 2 ^ 
small a;'s measuring in each case the deviations from respe 
means. Assume that each distribution taken separs^elj 
symmetrical one like a in Fig. 127 and a in Fig. 1284 
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127 let a represent the frequency curve of the total distribution 
of the X 2 variable. Then suppose this frequency distribution of 
all the variants of the variable X 2 is cross-classified into three 
groups, ( 1 ) those X 2 ’s associated with large values of Xi, ( 2 ) 
those associated with the ordinary or average range of values of 
Xi, and ( 3 ) those associated with small values of Xi. 



^ The plane is accordingly divided vertically into three parts 
r(»resenting the range of ( 1 ) large values of Xi (this part of the 
p&ne is labeled in Fig. 127); ( 2 ) ordinary or average range of 
XiYalues, represented in the figure by 7 ; and (3) small values of 
Xi^^epresented by 5 in the figure. 

Bj summarizing in a group those variates of X 2 associated with 
larg^ values of Xi (those in the range of /S in Fig. 127), and under 
the aiwumption that large values of X 2 are associated with large 
valueilpf Xi, a frequency distribution like 6 , whose mean would 
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be larger than the X 2 of the total population of variables X 2 , 
would be obtained. The line A A' intersects the^base of the 
frequency curve 6 at its mean point. 

By summarizing in a group the Z 2 variables in the y range of 
Fig. 127, a frequency distribution of X 2 variables like c would be 
obtained; then the one showing the X 2 variables associated with 
Xi in the range of d would give a frequency distribution like d. 
The line AA' in Fig. 127 also passes through the mean of the 
frequency curve d. In other words, the means of curves 6, c, and 
d, all lie on the same straight line, AA\ 

The Xi variable is treated in a similar manner in Fig. 128, in 
which a represents the frequency curve of all of the values of the 
Xi variable. This frequency distribution of all the Xi variables 
is then cross-classified into three groups, (1) those associated 
with small values of X 2 , (2) those associated with ordinary or 
average range of values of X 2 , and (3) those associated with large 
values of X 2 . The plane of Fig. 128 is accordingly divided hori¬ 
zontally into three parts, representing the range of (1) small 
values of X 2 (this part of the plane is labeled in Fig. 128); (2) 
ordinary or average range of X 2 values, represented in the figure 
by y; and (3) large values of Xi, represented by 8 in the figure. 
By summarizing in one frequency distribution the variates of Xi 
associated with small values of X 2 (those in the range of in 
Fig. 128), under the assumption that small values of Xi are 
associated with small values of X 2 > a frequency distribution like 
6, whose mean is smaller than the mean of the total population 
of variable Xi, would be obtained. 

By summarizing in one group the Xi variables in the range y 
of the X 2 variable, a frequency distribution of Xi variables like 
c would be obtained; the group of Xi variables associated with 
X 2 in the range of 5 will give a frequency distribution like d. 
The line passing through the means of these three frequency 
distributions would be like BB' in Fig. 128. 

Normal Correlation Surface, Correlated Variables, A bivariate 
frequency distribution showing the joint variation of two cor¬ 
related variables would thus appear to be represented by a 
frequency surface that is turned so as to make an angle with the 
Xx- and a; 2 -axes. A picture of a normal bivariate frequency sur¬ 
face for correlated variables is shown in Fig. 129. Figures 127 
and 128 constitute analyses of the frequencies of Fig. 129 that 
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divided the surface into three parts, first up and down and second 
left and right. The three figures, therefore, are an attempt to 
view the same distribution in three different ways. If any cross 
section is taken of the surface represented by Fig. 129, parallel to 
the Xi-axis, the cross section will have the form of a normal 
frequency curve with its mean on the line hb\ Any cross section 



Fig. 129.—A normal bivariate frequency surface, correlated variables. 

of this surface taken parallel to the X 2 -axis will have the form of 
a normal frequency curve with its mean on the line aa'. Such 
cross sections are similar in character to the frequency curves 
5, c, and d, discussed in connection with Figs. 127 and 128, 
respectively. Typical cross sections are likewise shown in 
Fig. 129. 

Careful study of Figs. 127 to 129 will aid greatly in the under¬ 
standing of the theory of correlation. They serve also as the 
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basis for comprehending the theoretical explanation in the ensuing 
section. ^ 

Derivation of Equation for Bivariate Normal Frequency Dis¬ 
tribution, Correlated Variables. Equation of a Rotated Ellipse. 
A quadratic equation of the general form 


aXl + 2hXiX2 + 6X2 + 2gXi -f“ 2/X2 + c = 0 (6) 


is an ellipse under the following 
conditions:^ 

ab — > 0 and D 9 ^ 0 


where 


D = 


'9 f c| 


a h g 

h b f = abc + hgf + gPi 


— aP — ch^ — hg^ 


For example, the equation 

XI - 4 X 1 X 2 + QXl - 24Xi 

+ 64 X 2 + 144 = 0 (60 

is an ellipse like that shown in Fig. 
130, expressed with reference to the 
large XiX 2 -axes. The center of the 
ellipse is at Xi = 4, X 2 = —4.2 
The equation for an ellipse with 
reference to the axes passing through 
its center is^ 



Fig. 130.—A horizontal 
cross section of a normal bi¬ 
variate surface, correlated 
variables. 


a^Xi + 2 h^XiX 2 + b'xl + c' = 0 (7) 

where a' = a, A' = h, b' = 6, and c' = D/{ab — h^). 


For Eq. (60 the new form m 

xl — 4x1^2 + 6x1 — 32 = 0 (70 

1 Fine, H. B., and H. D. Thompson, Coordinate Geometry, pp. 137-138. 

2 The center of the ellipse is found by solving the following two equations 
for Xl and X 2 : 

aXi + hX2 4 " == 0 

hXi + hXi -j- / = 0 

In this problem, a = 1, ^ = —2, ^ —12, 6 = 6, and / = 32. 

® Cf. Fine and Thompson, op. cU. 
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Following are the solutions for Eqs. (6') and (7'), from which 
Fig. 130 was drawn, the two equations describing the same 
ellipse: 


Equation (6') Equation (7') 

Solviion: Solution: 


Xi == 2 X 2 + 12 ± \/~2{Xl + 8 X 2 ) xi = 2x2 ± V32 - 2x1 


X 2 

X, 


X 2 

Xi 

0 

12 ± \/^ = 12 


0 

0 + Vsl = 6.7-5.7 

-1 

10 ± Vli = 13.74 

6.3 

±1 

± 2 ± V30 = ±7.5 T 3.5 

-2 

8 ± = 12.9 

3.1 

±2 

± 4 + Vm = ±8.9 + 0.9 

-3 

6 ± VaO = 11.5 

0.5 

+3 

± 6 ± Vli = ±9.7 ± 2.3 

-4 

4 ± VM = 9.7 

-1.7 

± 3.5 ± 7 ± V 7.5 = ± 9.7 ± 4.3 

-5 

2 ± VSO = 7.5 

-3.5 

±4 

± 8 ± V 0 = ±8 


The equations of the axes of the ellipse are obtained by finding 
the positive root of X in the following equations: 

h'\^ + (a' - 6')X - /i' = 0 

or, in this case, 

-2X2 - 5X + 2 = 0 

X = 2.85 

The equation for the major axis of the ellipse is therefore 
Xi = 2.S5x2y and the equation for the minor axis is 0^2 = —2 85xi. 

Referred to its own major and minor axes, the equation (/f the 
ellipse is Axl + Bxl + C = 0, where A and B are obtained from 

A+B = a' + b' AB = alV - C = c' 

and the condition that A — B has the same sign as For this 
ellipse it is thus found that A = 0.3 and B = 6.7. The equa¬ 
tion for this ellipse referred to its own axes (see Fig. 125) is 

0.3a:? + 6.7x1 - 32 = 0 

Mathematical Representation of a Bivariate Normal Correlation 
Surface. It was noted above that the bivariate normal surface 
in which a:i and x^ are independent of each other (that is, in 
which no correlation exists between them) is of the form 

dP{xiXi) = dxidXi ( 8 ^ 

ZTClff 2 
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The constant term, l/27r(ri<r2, is a constant dependent on the 
values of the two standard deviations in any particular instance. 

l/xi^ ■ Xi^\ 

The product of this constant times the term e 2 Vi^ <^227 gives, 
for various values of Xi and X 2 y the height of the bivariate surface 
from the base (the distance OP in Fig. 125). If a horizontal plane 
parallel with the base plane is drawn through the normal bivariate 
surface at a distance OP from the base plane, the intersection of 
the plane and the bivariate surface will be an ellipse (as in Fig. 
126) if the standard deviations are unequal; the intersection will 
be a circle (as in Fig. 123) if the two standard deviations are equal. 
Such a plane represents the locus of all points distant OP from the 
base plane, and the passing of such a horizontal plane through 
the bivariate surface is equivalent to setting the expression 

e 2 \<ri 2 a 20 equal to a constant which is equivalent to putting 


+ - c 

/r2 ^ ^ 


This equation represents a circle if ai = cr^ and an ellipse if 

<r 1 (r2. 

The smaller the constant c, the smaller will be the circle or 
ellipse until, at the peak of the bivariate surface a very small 
circle or ellipse will be found—finally, just a point. 

If the two variables are correlated, two changes occur. (1) 
The ellipse is rotated. ( 2 ) The ellipse is narrowed. If before 
correlation the surface is circular in form, owing to the fact that 
the standard deviations are equal, the existence of correlation 
will cause the circle to be converted into a rotated ellipse, narrow¬ 
ing the circle to an elliptical form. If before correlation the 
surface is elliptical in form, owing to the fact that the standard 
deviations are unequal (see Fig. 126), the existence of correlation 
will cause the ellipse to rotate and also to become narrower. 
This phenomenon is explained as follows: 

If larger than average values of Xi cause X 2 to be larger than 
average and smaller than average values of Xi cause X 2 to be 
smaller than average, the pull exerted on X 2 values is indicated 
by the arrows in Fig. 131. The larger the Xi, the more pull 
will be exercised upon X 2 to make it larger than its average. 
This is indicated by making arrow ( 1 ) longer than arrows (2), 
( 3 ), and ( 4 ), which, respectively, represent the degree to which 
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successively smaller values of Xi affect values of until, by 
the time Xi becomes smaller than average (below the line Xi), 
arrow ( 4 ') points to the negative pull, that is, causing X 2 to be 
less than its average. 

When correlation exists, this means that bivariate frequencies 
located in quadrant II, where Xi is larger and X 2 is smaller 
than average, tend to move over to quadrant I, where X 2 and 
Xi are both larger than average. Bivariate frequencies already 
located in quadrant I are less affected. Similarly, bivariate 
frequencies in quadrant IV tend to move to quadrant III, where 



Fig. 131.—Illustrating the difference between the nonexistence and existence of 
correlation in a normal bivariate frequency surface. 


both X 2 and Xi are smaller than average, while bivariate fre¬ 
quencies in quadrant III are less affected. The result is that 
the rotated ellipse becomes narrowed as shown in the part of 
Fig. 131 at the right. Any horizontal plane parallel to the base 
of a correlated bivariate (Fig. 129) will intersect the bivariate 
frequency surface in the form of an ellipse such as that shown in 
the right half of Fig. 131—large ellipses near the base plane, and 
smaller and smaller ellipses as the horizontal plane is raised 
higher and higher from the base. These ellipses have the 
equation 

ax\ 4 - 2 hxiX 2 + 6 x 3 + c = 0 
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As already noted, the middle term 2hxiXi is present in the 
equation because of the fact that the ellipse is rotated and now 
described in terms of axes other than its own, although the origin 
remains the center of the ellipse. The middle term is thus 
present because of correlation, which causes the rotation of the 
ellipse. This middle term is generally called the “ product term ” 
because it is the product of the two variables. When there is 
no correlation, this middle term disappears.^ The narrowing of 
the ellipse, as will be seen, resvllts in the increase in the value of 

the constant term s- 

ZTTff iff 2 

* Since the normal bivariate surface in which Xi and X 2 are 
correlated is thus elliptical in form but rotated and narrower than 
the elliptical surface representing uncorrelated bivariates, the 
distribution of probabilities or relative frequencies will be given 
by an expression of the form 

dP{XiX2) = (9) 

This is the general formula for a normal bivariate frequency 
distribution of correlated variables. The remainder of the 
argument, which appears in the Appendix to this chapter, shows 
how the parameters k, a, h, and b may be evaluated in terms of 
the moments of Xi and X 2 . When the proper values of the 
parameters are inserted, the formula is as follows 

1 _ 1 fxi^ XlX2 ■ g2^\ 

dPixiX 2 ) = - ■J= e 2 (i-r 2 )W dxidx 2 (10) 

27rffiff2 V 1 — 

This probability expression describes a normal bivariate 
frequency distribution such as that graphed in Fig. 129. The 
rotated position is reflected in the fact that the exponent of e 
has a middle ^‘product term.” The fact that the surface is 
narrower than it would be if there were no correlation is reflected 
in the character of the constant term, which is larger than the 
constant term of a normal bivariate frequency surface of uncor- 

^ See p 475. 

* See Appendix, pp. 492-496. 
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related variables.^ In other words, because r cannot be greater 
than 1, 


2Tr<Ti<r2 \/l ■“ 2nrai(T2 

The degree to which the constant term in the correlated surface 
is larger depends upon the value of r. If r = 0, the constant 
term becomes identical with the constant term of the uncor¬ 
related surface. If r = 1, the constant term of the correlated 
surface becomes infinitely large, reflecting the fact that when 
7* = 1 the surface becomes so narrowed that it is a plane, all 
points being on the line of regression. 

LINES OF REGRESSION 

In the discussion of Fig. 127 it was pointed out that the line 
AA* passes through the means of frequency distributions a, 6, 
and c. Similarly, in the discussion of Fig. 128, it was said that 
the line BB' passes through the means of frequency distributions 
a, 5, and c. In the discussion of Fig. 129 the line aa' was said to 
pass through the means of any frequency distribution made by a 
vertical plane parallel with the a;i-axis, and the line hV was said 
to pass through the means of any frequency distribution made 
by a vertical plane parallel with the X 2 -axis. These two lines 
are thus the progressions of the means for the normal bivariate 
surface. As will be shown shortly, they are also the least-squares 
lines that might be fitted to the surface. In both senses, there¬ 
fore, they are the lines of regression for the surface. 

If there is no correlation, as illustrated by Figs. 122, 123, and 
126, the two lines of regression correspond with the major and 
minor axes of the ellipse, that is, with the axes represented by 
the Xi and lines of Fig. 122 or Fig. 126. By hypothesis, in 
the uncorrelated bivariate surface the mean of any frequency 
distribution made by a vertical plane parallel to the iCi-axis will 
be on the Xx line, and the mean of any frequency distribution 
made by a vertical plane parallel to the a!; 2 -axis will be on the X^. 
line. When the surface is rotated and narrowed, as a result of 
correlation, it is part of the hypothesis that the normal symmetry 

^ The narrowing is due to a stretching upward of a given volume. As 
indicated, in the limiting situation (r = 1), the surface becomes a vertical 
plane stretching to an infinite height and having an infinitesimal thickness. 
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of the surface remains and accordingly the means remain in a 
straight line, but a straight line at an acute anglg rather than 
perpendicular to the original axis. 

Mathematical Representation of Lines of Regression. The 
bivariate normal correlation surface in terms of probabilities 
has been found to be described as follows: 


dP {X 1 X 2 ) = -—=e2(l-r2)W criaa <r^^XdXidX 2 (11) 

27ro-i<r2 vl — 

A line of regression, for example, the line of regression of X 2 on 
Xij is a general description of the law of relationship by which 
for a given value of xi the most probable value of X 2 may be 
determined. Equation (11) describes the joint probability of any 
bivariate X 1 X 2 . The probability of any value of a :2 occurring 
with some specified value of xi, say xi^ ^vill be as follows: 


dP{^iX^ = 


1 

2Tr<Ti<T2 \/l — 


1 

2(l-r»)L Vi / dXi dX 2 

( 12 ) 


If (xi/ai)^ is factored from the exponent of 6, the equation 
becomes 


dP{XiX2) = 


_ (xi)2 _ 1 /^ _ 2 ^ 

2<ri2(l—r2) 0 2(1 —r2) V 23 0 ^ 2 / dxi 


2Tai(r2 '\/l — 


dxi 


(13) 


X^ X\ X 2 

The square of ^ completed by adding 



which must also be subtracted to keep the value of the whole 
expression unchanged. This subtracted part may be conven¬ 
iently put with the other (xi)^ term so that the final result of 
these operations is as follows: 


1 —(xi)*—rHxi)^ 

dP{XiX2) = - \ 6 2<rx2(l-r2) ^ 

2w<ri<T2 V 1 — 


1 /xi_ 

2(l-r2)Vcr2 dXidX2 


(14) 


Upon simplifying the exponents and splitting up the constant 
term and the dxidx 2 y this expression becomes 
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dP(xiX2) = 


:= e 


<ri 


\/2w 


2 *"!' dxi 


1 

0-2 \/l — r2 \/2 t 


^a;2— 

(15) 


Since the first factor is a constant (£i being given), Eq. (15) 
shows that the probability of an X 2 for a given value of Xi is pro¬ 
portional to the probability of a normally distributed variate 

'i 

whose mean is r — :ri and whose standard deviation is <r 2 \/l -- r^. 

(Tl 

(It will be recalled that the general equation for the normal 

1 \ 

curve is- 7 = e 2 <^" dx ). Accordingly, the most probable value 

<T \/2t / 

of X 2 for specified values of Xiy that is, the line of regression of 
X 2 on xi, is as follows; 


/ 0’2 
X2 = r — xi 

0^1 


The standard deviation or scatter about this line is 0-2 \/l — r^. 

From Eq. (15) it is seen that the locus of all points representing 

the means of X 2 for a given Xi\b X 2 — xi, which is the equa¬ 
tion of the line of regression of X 2 on X\. The line of regression of 
Xi on X 2 is given by interchanging Xi and X 2 in the above argu¬ 
ment. As indicated above, these two lines are the same as 
those that might be fitted to the distribution by the method of 
least squares. From Eq. (15) it is also shown that the standard 
deviation of X 2 for a given Xx (in other words, the scatter at any 
point of the line of regression) is independent of the selected 
value of xi, for it is always equal to 0-2 \/l — 

NORMAL MULTIVARIATE FREQUENCY “SURFACE” 

When a bivariate distribution is described in geometrical 
terms, one of the dimensions can be used to measure the fre¬ 
quencies. This is not possible for distributions involving more 
than two variables. In the three-variable case, for example, all 
three dimensions must be used to indicate the variations in the 
variables themselves, and none is left to measure the frequencies. 

Resort is had in multivariate problems to the use of densities 
to measure frequencies. Such a device could have been used in 
the monovariate or bivariate case; instead of having the fre- 
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quency of any interval represented by the height of a rectangle 
erected on the interval, it could be assumed that the cases were 
re|)resented by points on a line, and the more points crowded into 
any given interval on the line, i.c., the greater the density of 
points in the interval, the greaW would be the frequency of 
that interval. Likewise, in the bivariate case, instead of repre¬ 
senting the frequency of cases in any given two-dimensional cell 
by the height of a rectangular pile of checkers set up on that cell, 
it would be possible to look upon the various cases as points in 
the two-dimensional plane; the frequency of points in any cell 
would then become the density of points in that cell. 

This is the device used to measure frequencies in the multi¬ 
variate case. For a trivariate distribution, for example, the 
various cases are looked upon as points in three-dimensional 
space, and the density of these points in any given three-dimen¬ 
sional cell becomes the measure of the relative frequency of cases 
in that cell. A trivariate frequency ‘^surface,’' if it may be so 
called, is in reality a tri variate density function. The same idea 
may be carried over by analogy to distributions of four or more 
variables, although no graphical representation can actually be 
made of such distributions. 

The properties of a normal multivariate surface’^ or density 
function are merely generalizations of the properties of a normal 
bivariate surface. Whereas in the latter case, loci of equi- 
probabihty loci of constant level on the frequency surface) 
were ellipses in the oJia^rplane, in the multivariate case loci of 
equiprobability {i.e,, loci of equal density in the iV-dimensional 
space) are ellipsoids in the Xi, X 2 , . . . , Xn space. A picture 
of a three-dimensional ellipsoid is given in Fig. 132. This repre¬ 
sents a contour of equiprobability for a trivariate normal distri¬ 
bution in which there is no correlation. Similar ellipsoids, some 
.larger, some smaller, would represent other contours of equi¬ 
probability, and the whole distribution could be represented by a 
nest of such elhpsoids. The elliptical contours representing a 
high degree of probability are, of course, the contours close to 
the center, the center itself being the point of maximum prob¬ 
ability (maximum density). As one goes off from the center 
in a straight line in any direction whatsoever, the change in 
probability (density) is in accordance with the normal law. If 
the variables are measured in standard-deviation units, the 
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ellipsoids become spheres and the distribution becomes sym¬ 
metrical in all directions. 

When there is correlation between the variables, the ellipsoids 
of equiprobability becomes tilted with respect to the various axes 
and flattened out. If the variables are measured in standard- 
deviation units, the degree of tilting in any direction is directly 
related to the amount of the correlation between the variables 
concerned. The greater the multiple correlation between the 
variables, the narrower or flatter the ellipsoids become. In 



Fig. 132.—Ellipsoid of equiprobability for a trivariate normal frequency 

distribution. 


the limit in which there is perfect correlation between all the 
variables, the whole distribution reduces to a line through 
the origin at an angle of approximately 54f degrees (cos“^ 1 \/3) 
with all the axes (assuming the variables are measured in <t units). 

As in the simpler case, a plane or hyperplane of regression 
represents the locus of the mean values of one variable for various 
combinations of the other variables. For a normal multivariate 
distribution, the deviations from any plane of regression are all 
normally distributed with a constant standard deviation for any 
one plane. 

All the properties of a normal bivariate distribution thus carry 
over to a normal multivariate distribution, the only difference 
being that ellipses of equiprobability and lines of regression now 
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become ellipsoids and hyperplanes of higher dimensions. Basi¬ 
cally, the character of the distribution is essentially the same. 

NONNORMAL BIVARIATES AND MULTIVARIATES 

If a bivariate or multivariate distribution does not approach 
the normal form, much of the conventional correlation analysis 
loses its significance. In some cases, by taking logarithms or 
reciprocals a nonnormal distribution may be transformed into a 
normal distribution.^ In some instances, a multivariate dis¬ 
tribution may be normal vnth respect to its variations about the 
means of the rows and columns but the means of the rows or 
means of columns may trace out a curve of regression. In other 
instances, the regressions of the means may be linear, or planar, 
but the deviations around these lines, or planes, of regression 
may be either nonnormally distributed or normally distributed 
with varying standard deviations. 

If, in the case of two variables, the regressions are linear, the 
initial arguments presented for the use of the product-moment 
formula for r are still valid even for nonnormal distributions.^ 
Large values of Xi would still tend in general to be associated 
with large values of X 2 (or with small values if the correlation is 
negative), and a formula based upon the product deviations 
would give a good measure of the association between the two 
variables. If the distribution of cases around the lines of regres¬ 
sion is skewed, however, or if the standard deviation varies from 
one part of the line to another, the scatter about the lines of 
regression loses its significance as a measure of typical variability. 
Great care must be taken in these cases in using an average 
scatter to determine the degree of error in an estimate based on 
the line of regression. When the distributions are not normal, 
the rule that two-thirds of the cases tend to lie between plus and 
minus ai,j no longer holds. 

Finally, if the bivariate distribution is not normal, even the 
product-moment formula may cease to be a statistic of special 
significance in characterizing the distribution. In the normal 
case, if the two means, the two standard deviations, and r are 
all known, the bivariate distribution is fully determined. In the 
nonnormal case, other measures similar to measures of skewness 

1 See Chap. XV, pp. 377-396. 

2 See Chap. XIII, pp. 338-353. 
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and kurtosis in the monovariate case may be of equal if not 
greater importance in describing the bivariate distribution. 
These considerations should always be borne in mind when r is 
used to measure correlation between nonnormally distributed 
bivariates. 

Similar statements may also be made about nonnormal multi¬ 
variate distributions. Here the higher dimensionality multiplies 
the possibilities of skewness, kurtosis, and other departures from 
normality.^ 


APPENDIX 


DERIVATION OF THE EQUATION FOR THE NORMAL BIVARIATE 
FREQUENCY SURFACE, CORRELATED VARIABLES. 

The normal bivariate surface in which Xi and are correlated is ellip¬ 
tical in form but rotated and narrower than the elliptical surface repre¬ 
senting uncorrelated bivariates. The distribution of the probabilities or 
relative frequencies is given by an expression of the following form: 

dP{xiX2) = dxidxi ( 16 ) 


in which the constants fc, a', h% and 6' may be evaluated in terms of the 
moments of Xi and X^. 

First it is to be noted that 


/ iP{xiX^ dxi dx2 

= 1 

(i) 

fSP{xiX2)xi dxi dx2 

= 0 

(ii) 

iiP{xiX2)x2 dxi dx2 

= 0 

(iii) 

i]P{x\X2)x\ dxi dx2 

= <r? 

(iv) 

S!P(XiX2)xl dxi dx2 

= A 

(V) 

J/P(a;ia;2)a:ia;2 dxi dxz 

— r(T\<T 2 

(vi) 


Equation (i) is true since the sum of all probabilities or relative frequencies 
is necessarily one. Equations (ii) and (iii) are true because X\ and X2 
represent deviations from the means of X i and X^. Thus / !P{xiX 2 )xi dxi dx^ 
Sf 

is equivalent to xi, which equals zero. Likewise, 


// 


P{XiX2)X2 dxi dX2 


- 1 / 


N 


■f X2 = 0 


2 / 

Equation (iv) is another form of xi, which is equal to the variance of 

Xi’y Eq. (v) is equivalent to a;!, which is equal to the variance of X 2 ; 

Sf 

and Eq. (vi) is equivalent to -^XiX 2 , which is equal to r<ri<r 2 , since 
r = 'LfxiX 2 /N<rvr 2 * 

1 For a more complete consideration of the problem of nonnormality, 
see Smith and Duncan, Sampling Statistics, Chap. 18. 
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Second it is to be noted that, with reference to its rotated axes, aa' and 
66', the equation of the . ellipse representing the intersection of the fre- 

quency surface by a horizontal plane at a distance e ^ from the base plane 
is as follows: 

Axi + Bx'i = C 

where x[ and X 2 represent the coordinates of a point with reference to the 
axes aa' and 66', that is to say, x^ measures the perpendicular distance of a 
point from aa' and X 2 measures the perpendicular distance of a point from 
66'. If the areal element^ dxi dx^ is also expressed in terms of the x[x[ 
coordinates, it becomes dxi dx^ = dx[ darj.* The whole probability function 
thus becomes 

dPix[x^) = dx[ dx 2 (17) 

But this is the form of a normal frequency surface for uncorrelated variables, 
so that, as seen above, pages 474-475, 



since there is no cross-product term, H — 0, 


1 It will be recalled that dP (X 1 X 2 ) = F{xix^ dx\ dx 2 is represented geo¬ 
metrically by the volume under the surface F(x\X>^ cut off by a hollow pipe, 
erected on a rectangle in the X 1 X 2 plane, the sides of which are dxi and dx^ 
(see Fig. 125, p. 475). To express the whole probability distribution in 
terms of the new x[x 2 coordinates, the area of the pipers base, dx\ dxz must 
be transformed into these new coordinates as well as the height of the pipe, 
F{XiX2), 

* The transformation of coordinates is of the form 

Xi = X 2 sin oc x[ cos a 
X 2 = X 2 cos oc — Xi sin a 


where a is the angle that aa' makes with the X 2 -B,xis. Cf. Fine and Thomp¬ 
son, Coordinate Geometry, p. 120. Since, in general, dxi dx 2 equals, within 
differentials of higher order. 


8x2 8x1 
8X2 SX2 
jSaJa Sxi 
1 8X1 8X1 


dx\ dx 2 


it follows that 

dxi dx 2 


cos a sin a 
— sin a cos a 


dx[ dx^ = dx[ dojJ 


since cos^ a + sin®« = 1. Cf. Wilson, E. B., Advanced CcdcvluSf pp, 
133-134. 
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The distribution function, Eq. (17), may therefore be written as follows; 


dP{x[x[) = -74;- e 2[ («') +(«') ] dx[ dx. 




(18) 


where o-J and o-j are the standard deviations of the new variables and Xz. 
It will be noted that this transformation has not changed the probability 
of a given X 1 X 2 combination but has merely expressed it in terms of a new 
set of coordinates. Accordingly, PixiXz) — P{x\x^, where Xi and a4 are 
derived by a linear transformation from X\ and x^.* 

Finally, it will be noted that in any equation of the second degree the 
product of the coefficients of the squared terms minus the square of one- 
half the coefficient of the cross-product term is invariant (that is, its value 
remains unchanged) under simple translations and rotations.^ Accordingly, 
the following relationships hold: 


or since H == 0, 


AB - = aV - 


AB = aV - h'^ 


But inasmuch as A = B = (l/fra)^, it follows that 


AB = 


1 


From this it follows that 


aV - h'" 


, _ VaV - h'^ 

^ 27r 


Use will now be made of these relationships to derive the values of a', 6', 
and h\ 

As noted above, since 


then 






k 


2x0-1 (Tg 


2x 

\/a'b' - (h'y 


(19) 


If both sides of Eq. (19) are differentiated with respect to a', it is found 
♦ that 


* See footnote *, p. 493. 

^ Cf. Fine and Thompson, op, cit., p. 131. 
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— J da:, dx2 = - 


2 la'b' - 

= 1 V _ 

2 k[a'b' - (ft'*)] 


By canceling out — ^ and multiplying the equation by k, the left side is 
equal to [see Eq. (iv), page 492], and it is found that 


[a'6' - (h'^)] 


or 6' = <rl[a'b' - 


If similar procedure is followed after differentiating Eq. (19) with respect 
to 6', it is found that 


Wb' ^ (h'^)] 


or a' = alia'b' - (h'")] 


If both sides of Eq. (19) are differentiated with respect to h'j it is found that 

-// XiX2e~i(<^^^i^'i'2FxixiAyx2^) dxi dx^ 

1 27r(-2ft') ^ _ i-h') 

2 [a'b' - (ft')*]3 k[a'b’ - (ft'’)] 

in which, if multiplied through by k, the left side equals —a-i<T 2 ri 2 [see Eq. 
(vi)], and hence the whole expression reduces to 


[a'b' - ih'^)] 


(c) h' = -<ri<r2ri2[a'b' - 

From Eqs. (a), (b), and (c), it follows that 


7./ f I./ , 

b = —„a' h' — - a 


Va'b' - (ft')* = — \/n^* 

0-2 

Equations (a), (6), and (c) are three equations from which the values of 
a'j h'j and h' may be expressed in terms of <ri, <t 2 , and r. The direct evalua¬ 
tion of a', 6', and h' from these equations is not a simple matter, however, 
and it is easier to proceed as follows: From Eqs. (a), (5), and (c), it is pos¬ 
sible to express 6' and h' in terms of a', as noted above in Eq. (d). It will 
also be recalled that 

, , , _ Va'b' - (ft')* _ a'ai 
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By substituting equivalent values, Eq. (16) may accordingly be written 
as follows: 


dP(xiX^ 


, /-- - a'<ri 2 /xi 2 , a; 2 \ 


27r<r: 


dx2 


( 20 ) 


The double sum ffdP(xiX 2 ) = 1, however, so that, from Eq. (20), it 
follows that 


// 


- Lg 2 W <ri« dxi da;a = i (21) 

2t(T 2 a 


If both sides of Eq. (21) are differentiated with respect to a', it is found 
that 


// 


^ _ P 2 \<n^ aiai 


27ra‘2 


|_2 \<r? 0 - 10-2 (t\/ } (a')^ 


Multiplication of both sides by —a' and expansion of terms then give the 
following: 


2\ , ^ /i 9 oVi2/xi2 - a;ia;2 , a:22\ 

xtl a 0-1 Vl — r*-^('-5-2r-1-r) 

I _!_Z_ g 2 \<ri2 (rier2 ff2V 


2 ^ 0-2 


daji da :2 


/• y X / -o oVi2/xi2 a:ia?2 , a:22\ 


< 7 . Vl - »•* 1 

_LV- g 2 V.r,7^ «l.r2 «Yda:idX2=- 

ror2 a 


But the left side is equal to 

i f j 4dP(-,.,, -r^j j 

which, according to Eqs. (iv) to (vi), is equal to 


XiX2ldP(XiX2) + —^ 

2 o. 


U{ 


x\dP {X 1 X 2 ) 


Hence, 


\ (<7?) - r - (r<r.<r.) + A (,|) = ^*(1 _ ^. 2 ) 

0*2 ^O* 2 


0 - 1(1 - r*) = — or 
a' 


«r?(l - r^) . 


If this value of a* is substituted in Eq. (20), it will give an equation in 
which all the parameters have been evaluted in terms of the moments as 
follows: 


dP{xiX2) = 


2 x 0 - 10^2 a/ 1 — r* 


— —J—- 1 -^ 2 ! 1 

9 2 ( 1 —r*)\o-i* vvn^oi^J 


dXi dX2 (22) 



PART V 


Study of Dynamic Variability 

CHAPTER XIX 
INDEX NUMBERS 

One of the most widely used statistical methods is the proce¬ 
dure that gives rise to the summarizing or expression of data in 
the form of index numbers. It is an application to a practical 
problem of simple principles of ready comparison, principles of 
averaging to obtain summary figures, and principles of stratified 
sampling. Today, the method of index numbers is applied in 
five large fields, as follows: 

1 . The measurement of the general price level, or the measure¬ 
ment of general exchange value. 

2 . The ineasurement of groups of prices, such as wholesale 
prices, retail prices, or wages. 

3. The measurement of the general quantity of production or 
trade with indexes of physical production, trade, or employment. 

4. The measurement of the general volume of business or 
trade with indexes of the value of production or trade, or with 
so-called “barometers.^' 

5. Miscellaneous, including a wide variety of uses, some of 
which are given below on pages 511-513. 

History of Index Numbers. General use of the device known 
as an “index number" to serve as a comprehensive method of 
summarization is of recent origin. Like most of the modern 
technique of statistics it has been developed since 1900. But 
the fundamental idea is an old one. According to Warren and 
Pearsonj as early as 1738 Dutot made price comparisons showing 
that a group of representative commodities cost twelve times as 
much in 1735 as they did in 1608.^ In 1764, an Italian, G. R. 

^ Warren, George F., and Frank A. Pearson, Prices (1933), pp. 18-20, 
containing other interesting examples of attempts to measure changes in 
general pri^e l^vel prior to the middle of the nineteenth century. 
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Carli, attempted an investigation into the effect of the discovery 
of America upon the purchasing power of money; he constructed 
a very simple index number of prices, using only three com¬ 
modities, grain, wine, and oil. He combined the prices of these 
three commodities in order to compare their average level in 
1750 with the level of the same commodities in 1500.* The 
gold movement from the New World to European countries 
aroused speculation throughout the mercantile period with 
respect to the relationship between prices and the amount of 
money in circulation. Locke and Hume laid the groundwork 
for the statement of what is now known as the quantity theory of 
money. Speculations of the seventeenth and eighteenth cen¬ 
turies, however, with the exception of Carlins unusual attempt, 
were without the assistance of any measurement of the general 
price level. 

Concern about the problem was brought to a new height during 
the Napoleonic Wars, when prices were fluctuating widely; and 
again during and following the Greenback era in the United 
States the question of the relationship between the general price 
level and the money supply became associated with inflationary 
issue of paper money. In the decade preceding the Civil War 
the discoveries of gold in Cahfornia served to arouse interest in 
the question of the effect of increased supplies of gold upon the 
general price level. 

Twentieth-century economists, already interested in the 
quantity theory of money by reason of the accumulation of these 
historical experiences, were provoked to continued and diligent 
study by the development of the South African gold mines since 
the 1890’s, accompanying world-wide general rising prices until 
the First World War. During the First World War and the 
subsequent period of maladjustment, with countries all over the 
world alternately on and off the gold standard, speculation in 
monetary theory became of such general interest that the prob¬ 
lem preoccupied some economists almost to the exclusion of other 
fields of study. 

Meanwhile the statistical technique of measuring general price 
change by the index-number device was developed; by 1798, 

* MiTCHiiLL, Wesley C., ‘‘Index Numbers of Wholesale Prices in the 
United States and Foreign Countries,” Bureau of Labor Statistics, Bulletin 
284, p. 7; cf. also reprint of Part I, “The Making and Using of Index Num¬ 
bers,” Bvllelin 656 (1938), p. 7. 
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Sir George Schuckburg-Evelyn formulated a plan for making 
index numbers of prices. ^ The efforts of the early statisticians in 
this direction were accorded but scant approval by the econ¬ 
omists, who were apparently suspicious of ‘Apolitical arith¬ 
metic.’^ Ricardo said that it is impossible to determine “the 
value of a currency” by its “relation not to one, but to the mass 
of commodities.”^ Early in the nineteenth century mathe¬ 
maticians were more interested in the application of the theory 
of probabilities in the fields of astronomy, biology, anthropology, 
and geology. The great exponents of the developing technique 
in the application of statistical theory to the social sciences, such 
as Qu^telet, were busy with problems in the realm of ethics and 
morals; but about the middle of the nineteenth century came 
powerful support for the application of these principles to eco¬ 
nomic statistics. 

William Stanley Jevons claimed that the works of Qu6telet 
abundantly proved that many subjects in the social sciences are 
so hopelessly intricate that they can be analyzed only by the 
use of averages and by trusting to probabilities as the form of 
generalization. He constructed indexes of wholesale prices in 
order to measure the value of gold and invoked the theory of 
probabilities as justification of his claim that the rise in prices 
was connected with the change in the value of gold, saying that 
“the odds are 10,000 to 1 against a series of disconnected and 
casual circumstances having caused the rise of prices—one in the 
case of one commodity, another in the case of another—^instead 
of some general cause acting over them all.” The general cause 
acting over them all was considered to be the change in the 
value of gold.^ 

In 1887 Prof. F. Y. Edgeworth began a series of contributions 
to the problem of index numbers as a method of summarizing 
trends in price statistics. He brought to bear upon the field of 
the social sciences the mathematical theory gf probability. He 
saw clearly that it is a problem of applying a strictly a priori 

^ ‘‘An Account of Some Endeavors to Ascertain a Standard of Weight 
and Measure,” Philosophical Transactions of the Royal Society of London, 
Part I, Art. viii, pp. 132-185; citation from Wesley C. Mitchell, Business 
Cycles—The Problem and Its Setting (1928), p. 191. 

2 lUd., p. 193. 

® Ibid., p. 195. 
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theory to an analogous situation, but he insisted that the theory 
of probability applied.^ Later, the theoretical application of 
probaKlities to the problem of measuring social phenomena, and 
particularly the general price level, was taken up by C. M. 
Walsh, who published in 1901 a treatise on the measurement of 
the price level and later published a book entitled The Problem 
of Estimation, which further developed the application of prob¬ 
ability theory to economics.^ 

Since about 1915 the important problem of the technique of 
index-number construction has been attacked by a number of 
scholars. Prof. Wesley C. Mitchell was a pioneer in the explora¬ 
tion of the technical problems involved and a major part of their 
solution; others have done important work of this character 
during recent years, especially the economists and statisticians 
in government or semigovernment agencies, such as the Bureau 
of Labor Statistics and the Federal Reserve Board. 

Interpretation of the problems involved in the making of 
index numbers may be facilitated by analysis of two of the main 
principles involved: (1) the concepts of absolutes vs. relatives, 
and (2) the application of the theory of stratified sampling to the 
particular problem of the making of an index number. 

Conversion of Absolute Numbers to Relative Numbers. 
Absolutes. An absolute is an expression of the number of things 
being considered, measured by an appropriate unit, as 1,000 
bushels of wheat or 50 acres of land. A simple absolute taken 
by itself is of little importance. The number of people in a 
country is of no particular significance unless a comparison is 
desired, for example, a comparison with the natural resources 
of the country or with the population at some other point in time 
or in some other country. 

Prices are ordinarily conceived of as absolutes; that is, saying 
that the price of wheat today is one dollar a bushel refers to the 
objective thing, namely, the concrete one dollar. It is true that 
this particular absolute has a ratio aspect when it is thought of 

^ Pbksons, W. M., “Statistics and Economic Theory,” Review of Economic 
Statistics, Vol. 7, (1925), pp. 185-186. Also cited in Wesley C. Mitchell, 
Business Cycles—The Problem and Its Setting, (1928), p. 197. 

* The Mecmtrement of General Exchange^Value, pp. 553-574, cited in 
Wesley C. Mitchell, “Index Numbera of Wholesale Prices in the United 
States and Foreign Countries,” Bureau of Labor Statiatics, p. 9. 
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as a measure of the value of wheat. But when thought of merely 
as one of the goods in an exchange, the dollar can rationally be 
considered to be an absolute. Prices accordingly are referred 
to as “absolutes.’’ 

Relatives, In tabular form a ready visualization is often 
accomplished by converting absolutes to relatives of some selected 
base. For example Table 63 shows data on three important 
types of productive activity in the United States. 


Table 63.— Estimated Value of Selected Types of Private Con¬ 
struction Activity in the United States 


Years 

New factory 
construction 

Farm construction 

New nonfarm resi¬ 
dential construction 

Millions 
of dollars 

Index* 

Millions 
of dollars 

Index* 

Millions 
of dollars 

Index* 

Annual average 
1926-1929 

640 

100 

468 

100 

4,066 

100 

1932 

78 

12 

125 

27 

641 

16 

1933 

128 

20 

175 

37 

314 

8 

1937 

391 

61 

360 

77 

1,530 

38 

1938 

192 

30 

345 

74 

1,515 

37 

1939 

200 

31 

340 

73 

1,860 

46 

1940 

337 

53 

360 

77 

2,077 

51 


Source: Survey of Current Business, Vol. 21 (February, 1941), p. 21. 
* Each index is on the base, average 1926-1929 = 100. 


Considerable difficulty is encountered in obtaining a clear 
mental picture of the comparative changes in these three series 
by study of the absolutes themselves. Was the decline in new 
factory construction more severe in the 1932-1933 depression 
than the falling off in new residential construction? Did farm 
construction suffer more severely than new residential nonfarm 
construction? Such questions, involving comparative judg¬ 
ments, can be answered much more quickly if each series is 
converted into relatives or simple indexes upon a common base 
period. This is illustrated in Table 63, in the columns presenting 
the indexes with average 1926-1929 as the base. 

Simple index numbers, or relatives, of this sort involve the 
notion of comparing with unity. The mind more readily 
grasps expressions in round numbers than in odd numbers; it 
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further reduces mental effort if the round numbers are in mul¬ 
tiples of 10. From this fact arises the practice of relating prices 
or other quantity figures or absolutes of any kind to each other 
in such a way as to get a comparison based upon 1, 10, 100, 
1,000, etc. If based upon 1, they are called “ proportionsif 
based upon 100, tliey are called “percentages.^^ They are all 
relatives, or indexes. Most commonly in the United States and 
in Great Britain and many other countries, 100 is used, although 
a few, notably Australia, use 1,000. 

Even where there is but one price series, it is simpler to com¬ 
prehend the significance of change if the absolute prices are 
converted to a relative form. For example, the changes in 
price of coffee per pound, as shown in Table 64, are easier to 
trace from period to period when expressed in relatives. Thus, 


Table 64. —Price of Coffee 
Annual averages in New York market of No, 7 Rio coffee 
(In dollars per pound) 


Item 

Symbol 

1926 

1933 

1934 

1941 

Price, lb. 

p 

0.182 

m 

0.098 


Relative. 

(100/0.182)p 

100 


54 

44 


Source: Bureau of Labor Statistics, Wholesale Prices (June and December issues of 
specified years). 


let 1926 be considered 100 and the prices in other years related 
to it. The arithmetic involved is simple in principle and contains 
two steps: (1) dividing the series throughout by the base selected, 
which may more conveniently be done by multiplying throughout 
by the reciprocal of the base and (2) multiplying by 100. This 
method, illustrated in Table 64, makes the figure for the base 
period equal to 100, and the rest fluctuate as percentages of the 
base. 

Another elementary idea is involved in the making of relatives, 
and that is the reduction of nonhomogeneous sets of figures to a 
homogeneous base for purposes of comparison and to simplify 
interpretation of relative change among nonhomogeneous things. 
For example, the prices of coffee per pound at different times, 
the prices of canned peaches per dozen cans, and the prices of 
wheat per bushel are all three presented in Table 65 for com¬ 
parison with each other. 
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It is difficult to compare the price of coffee per pound with the 
price of wheat per bushel on the one hand and with the price of 
canned peaches per dozen cans on the other, as they fluctuate 
from time to time; but if all are changed to relative numbers, 
by the method already described, with 1926 as a base period, 
the comparison may easily be made. This is illustrated in 
Table 66. 


Table 65.— Prices of Coffee, Canned Peaches, and Wheat^ 


Item 

1926 

1933 

1934 

1941 

Coffee. 

0.182 

0.078 


0.080 

Canned peaches. 

1.993 

1.146 

1 1.403 i 

1.528 

Wheat. 

1.496 






Source: Bureau of Labor Statistics, Wholesale Prices. 

1 Prices of canned peaches are annual averages, quoted in dollars per dozen cans; prices 
of wheat are of No. 2 hard, Kansas City, quoted in dollars per bushel. 


Relatives Using a Base Period in a Time Series. Price relatives, 
and the relatives shown in Tables 65 and 66, illustrate relatives 
using a base period in a time series. Three fundamental pre¬ 
cautions must be observed in the use of such relatives. 


Table 66.—Price Relatives of Coffee, Canned Peaches, and Wheat 
Average 1926 = 100 


Item 

1926 

1933 

1934 

1941 

Coffee. 

100 

43 

54 

44 

Canned peaches. 

100 

58 

70 

77 

Wheat. 

100 

48 

62 

66 


1 . It is almost always advisable and sometimes it is necessary 
to know the absolute figures as well as the relatives—else mis¬ 
interpretation or even misrepresentation is likely to result. 
A classic example of a use of relatives that produced misinterpre¬ 
tation and may perhaps have even been intended to be misrepre¬ 
sentation was the evidence presented in 1932 by some notable 
protectionist ‘^statesmen’’ in the United States Congress. 
Following are some of the statistics they issued for public 
consumption: 
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Table 67. —Some of the Large Increases in Imports during the First 
8 Months op 1932 
(In percentages) 

Percentage 


Commodity Increase in Imports 

Cod and other salt and pickled fish from Denmark... 3,729.8 

Salmon, fresh or frozen, from Japan. 2,511.8 

Fish in airtight containers from Canada. 4,669.9 

Cheese from Denmark. 136.3 

Wrapping paper (other than kraft) from Sweden.... 615.9 

Pig iron from Sweden. 181.0 

Pig iron from the United Kingdom. 611.3 

Wool and other yarns from the United Kingdom— 221.2 

Long-staple cotton from Egypt and British India, 
but transshipped from the United Kingdom: 

Egypt. 1,283.1 

British India. 1,128.1 

From Canada, fresh pork. 237.9 

Dried peas from New Zealand. 477.3 


The purpose of these statistics was to prove that a veritable 
flood of foreign goods was threatening to inundate this country, 
put out of business all its domestic producers, and lower the 
wages of domestic workers. But the statistics are not what they 
seem to be. Some of the items were so very small in the aggre¬ 
gate in January, 1932 (the date they began to increase according 
to the table), that they were not even listed in the extensive 
classified list of imports that is published monthly by the Depart¬ 
ment of Commerce. If an exceedingly small item is increased by 
1,000 per cent, it is still small. Each time it increases 1,000 per 
cent, it is only eleven times as large as before; 2,000 per cent 
means twenty-one times as large. In January, 1932, the amount 
of pig iron imported into the United States from Sweden and the 
United Kingdom combined was less than 460 tons, worth about 
$4,500, which, compared with United States domestic produc¬ 
tion, was a mere nothing. The imports in January, 1932, of 
wrapping paper other than kraft from Sweden amounted to 
$2,025. The imports in the same month of Egyptian cotton, 
transshipped from the United Kingdom, amounted to $982. 
The last item is particularly enlightening; it will be noted that 
the specification ‘transshipped from the United Kingdom” is 
carefully made. Most cotton of this type comes to the United 
States directly from Egypt and is not essentially competitive 
with American-grawn cotton. 
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2. The meaning of a percentage figure is often ambiguous, and 
study of its background is necessary before it can be properly 
understood. An illustration of the misinterpretation of a per¬ 
centage figure can again be found in the arguments of American 
protectionists. When it is alleged that our tariffs are already 
too high, the protectionists like to reply that they are not too 
high. To prove their statement they point to the fact that a 
large percentage of the imports are on the free list,” that is, 
that a large proportion of imports into the United States are 
charged no duty at all. This argument sounds plausible, but 
its non sequitur quality becomes evident when it is realized that 
the tariffs on dutiable goods are so high that they are virtually 
excluded from entering; it is thus the virtual exclusion of certain 
dutiable imports that causes a large proportion of imports to 
be goods on the free list. If the entire 100 per cent of imports 
were on the free list, it would mean, not that the tariff was not 
high, but that the tariff was so high that none of the dutiable 
goods could come in. 

3. In a series of coordinate relatives, it is necessary to know 
the base and to specify it for the information of others. For 
example, death-rate figures are quite meaningless unless the 
comparison is known. The death rate may be expressed as so 
many deaths per 1,000 people or per 100 people, and the statisti¬ 
cian should indicate which comparison is made. Death rates 
for a given disease may be expressed as the number of deaths 
per 1,000 people afficted by the disease rather than as the number 
of deaths per 1,000 people whether exposed to it or not; again, the 
nature of the comparison should be specified. 

In simple index numbers like those given as examples in 
Tables 64 and 66, it is essential to know that the base is 1926 
or the average of 1926-1929, as the case may be. This should 
always be indicated somewhere in the title or subcaptions of the 
table or in a footnote. 

Presumption of Normality in the Selected Base. When a series 
of coordinate relatives is constructed by relating a series of 
absolutes to some selected base, the base tends to be regarded 
as the normal level. Indexes in the series greater than 100 are 
looked upon as above normal, or above par, and indexes less 
than 100 are looked upon as below normal. Since this tendency 
exists, it is always desirable to give study to the matter of 
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selecting the base. Is it in fact the one that is at normal level, 
or is one of the other absolutes of the series at normal level? 

For example, in the illustration given in Table 63, should the 
annual average of new factory construction, 1926-1929, be 
regarded as normal? Was there, on the average, a normal 
amount of new nonfarm residential construction in those years? 
It might reasonably be argued that taking 3 years as a base is 
better than taking only 1 year, because an average of 3 years 
might tend to offset extreme fluctuations and produce compari¬ 
sons that would tend to be better than if only 1 year were used 
as a base. Thus the average of the 3 years might be about 
normal for each of the three types of construction compared, 
whereas if only 1 year were taken one or the other of the three 
types might have had an exceptionally high or low year. 

On the other hand, it may be pointed out that the years 
1926-1929 covered a range of years in which a great construction 
boom reached its peak. Consequently, all types of construction 
were above normal in all three of those years; some writers claim 
this was the peak of the greatest and longest construction boom 
in history. Construction was at a high level such as it might 
not be expected to reach again for many years, at least if the 
length of construction booms is some seventeen years from peak 
to peak, as some say it is. It may therefore be argued that 
1937 would be a better base to take, even if only 1 year is used. 
In that year the general level of economic activity seemed to 
be nearer to a normal or equilibrium than any other year in 
recent history, and certainly nearer normal than the boom year 
of 1929. But the year 1937 would be a poor base year for strike 
statistics because of the great disturbances in the coal industry 
in that year. 

Selection of the base has an important effect upon subsequent 
judgments as to the trends of the three series. If the average 
1926-1929 is taken as the base, all three of the construction 
activities were still below normal in the year 1940, as the indexes 
in Table 63 show; but if 1937 is taken as the base, the 1940 level 
of new factory construction would be 86, the 1940 level of farm 
construction would be 100, and the level of new nonfarm residen¬ 
tial construction would be 136. If 1937 is considered normal, 
in the years 1926-1929 new factory construction averaged 63 
per cent above normal, farm construction averaged 30 per cent 
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above normal, and new nonfarm residential construction was 
166 per cent above normal. 

The data shown in Table 64 and the indexes presented in 
Tables 65 and 68 may also be used to illustrate the effect of the 
base selected. In Table 66 and Fig. 133, the year 1926 is the 
base and. the prices of coffee, canned peaches, and wheat are 
each set at 100 in that year; subsequent years are indexed 
accordingly. From 1926 to 1933 the greatest decline occurred 
in the price of coffee, the next greatest occurred in the price of 
canned peaches, and the decline in the price of wheat was com- 



Fig. 133. — Indexes of prices of wheat, canned peaches, and coffee. 1926 = 100. 

paratively the least. Their relative recovery was in the same 
order, and all three were below normal in 1941. 

But if it is considered that they were at normal levels in 1941 
so that all are called 100 for that year, a quite different picture 
is obtained, as shown in Table 68 and Fig. 134. If these prices 


Table 68.—Price Relatives of Coffee, Canned Peaches, and Wheat 

1941 - 100 


Item 

1926 

1933 

1934 

1941 

Coffee. 

228 

98 

122 

100 

Canned peaches. 

130 

75 

92 

100 

Wheat. 

151 

73 

94 

100 


were normally related to each ether in 1941, then in 1926 the 
price of coffee was more than twice normal, the price of wheat 
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was well above 50 per cent over normal, and the price of canned 
peaches was 30 per cent over normal. Moreover, Fig. 134 and 
Table 68 seem to indicate that it was the price of canned peaches 
that was farthest below normal in 1933; the price of coffee was 
only slightly below normal. 



Fig. 134.—Indexes of prices of coffee, canned peaches, and wheat. 1940 = 100. 

For most comparisons, a year too remote in the past is not a 
desirable base. For a long time, 1913, or an average of the 
years 1909-1914, was looked upon as the best, base period to use, 
because it was the last normal period before the First World 
War. The farm bloc in Congress continued as late as 1941 to 
insist that farm prices should be permitted to rise to the par 
that existed before the First World War; but in 1941-1942, as 
farm prices began to rise at a more rapid rate than other prices 
so that they passed parity, the farm bloc began to insist upon a 
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new definition of parity. The long survival of 1909-1914 as a 
base illustrates, not the general desirability of having a remote 
base period, but merely the power of the fa,rm bloc. Ordinarily, 
general economic change over a 28-year period is sufficiently 
great to make such a base undesirable. 

In the 1920’s, accordingly, most comparisons came to be made, 
not with prewar 1913, but with the average of 1923-1925 or with 
the single year 1926; these years persisted as a base period much 
longer than might ordinarily be expected because the extreme 
decline of the depression of the early 1930’s made it difficult to 
select a new base period. Finally, however, as the years of the 
Second World War passed, the period immediately preceding it 
came to be regarded as the best base for current comparisons. 
In the early 1940’s the average for the years 1935-1939, or one 
of those years, began to be adopted as the base period.^ 

Relative Parts of a Whole. A single absolute quantity is often 
divided into several parts, and these several parts are expressed 
as percentages or proportions of the whole. These are properly 
called, not ‘‘index numbers,^^ but simply “relatives,^^ although 
they could be referred to as “constituent relatives.’’ The term 
index numbers, used with strict propriety, refers to a series of 
relatives that is a composite of a more or less large number of 
series of relative numbers. The series of relatives may be com¬ 
bined to form a series of index numbers by any one of a number 
of methods of aggregating or averaging, as will be explained 
later in this chapter. Accordingly, in strict usage when an 
index is an average of relatives, the term relative should be 
reserved for the separate ingredients and the term index should 

^ In the Survey of Current Businessj Vol. 22 (November, 1942), the index 
of prices received by farmers was stUl reported on the base of the average of 
1909-1914 prices, the index of wholesale prices was still based on the 1926 
average, and the index of retail prices was based on the average for 1923- 
1925; but the cost-of-living index was based on the average 1935-1939, 
and the indexes of the purchasing power of the dollar (wholesale, retail, 
and farm) were based upon the 1935-1939 average. The indexes of national 
income and industrial production were on the average 1935-1939 base. 
The index of some manufacturing data, such as orders, shipments, and 
inventories, were based upon the averages for the single year 1939. The 
Survey of Current Business^ Vol. 22 (December, 1942) published the Bureau 
of Labor Statistics indexes of wage-earner employment and weekly wages in 
manufacturing industries, revised, with the average of the year 1939 as 
the base. , 
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be used for the composite. Yet this distinction is often honored 
in the breach as well as in the observance, and the student must 
expect to find the term index used in place of relative. 

An important item to remember in the use of constituent 
relatives is that a relative increase or decrease does not neces¬ 
sarily mean an absolute increase or decrease in the subgroup. 
The absolute of the subgroup may, indeed, move in the opposite 
direction from that indicated by the relative figures. Con¬ 
stituent relatives are useful when it is required to see clearly 
the relative changes. If absolute changes are desired, the raw 
data must be examined. Table 69 is an example of the use of 
constituent relatives. It reveals the necessity of attention to the 
absolute as well as the relative figures. 

Table 69.— Death Rates per 100,000 Policyholders from Selected 

Causes 

Weekly premiunv^paying industrial husinesSy Metropolitan Life Insurance 

Company 


Specified causes of death 

Annual rate per 
100,000* 

Percentage distribution 
of specified causes 

1940 

1941 

1942 

1940 

1941 

1942 

All. 

531.6 

553.9 

501.6 

100.00 

100.00 

100.00 

Diabetes mellitus. 

31.1 

33.8 

30.2 

5.85 

6.10 

6.02 

Appendicitis. 

8.6 

7.6 

5.4 

1.62 

1.37 

1.08 

Influenza and pneumonia. 

74.5 

79.5 

47.2 

14.01 

14.35 

9.41 

Tuberculosis (all forms). 

44.9 

44.0 

41.9 

8.45 

7.94 

8.85 

Syphilis. 

12.4 

11.0 

10.0 

2.33 

1.98 

1.99 

Cancer (all forms). 

1 

102.1 

103.8 

102.2 

19.21 

18.74 

20.37 

Diseases of the heart. 

233.3 

245.9 

236.7 

43.89 

44.39 

47.19 

Motor-vehicle accidents. 

17.2 

20.2 

21.3 

3.24 

3.65 

4.25 

Suicides. 

7.5 

8.1 

6.7 

1.41 

1.46 

1.34 



Source: Metropolitan Life Insurance Company, Statistical Bulletin, March, 1942, p. 11. 
* Policyholders, based upon first 3 months of each year. 


In order to illustrate the necessity of presenting the absolute 
figures as well as relative figures when constituent relatives are 
used. Table 70 is drawn up with a few items taken from Table 69. 
Study of the percentage distribution shown in Table 70 would 
appear to indicate that the death rate from suicides increased 
between the years 1940 and 1942. Actually, it decreased. 
Merely its relative position became more important. The per- 
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centage distribution, in the absence of attention to the absolute 
figures, would also lead to a tendency to exaggerate the rise in 
the death rate from automobile accidents. These misleading 
results are due to the change in the size of the totals for the 
respective years considered—^from 107.8 in 1940 to 115.4 in 
1941 and to 80.8 in 1942. 

Table 70.— Death Rate per 100,000 Policyholders from Selected 

Causes 

Weekly 'premium-paying i'ndustrial business, Metropolitan Life Insurance 

Company 


Specified causes of death 

Annual ratei 

Percentage distribution 
of specified causes 

1940 

1941 

1942 

1940 

1941 

1942 

All. 

107.8 

115.4 

80.8 

100.00 

100.00 

100.00 

Influenza and pneumonia. 

74.5 

79.5 

47.2 

69.11 

68.89 

58.56 

Appendicitis. 

8.6 

7.6 

5.4 

7.97 

6.59 

6.70 

Suicides. 

7.5 

8.1 

6.7 

6.96 

7.02 

8.31 

Motor-vehicle accidents. 

17.2 

20.2 

21.3 

15.96 

17.50 

26.43 



1 Per 100,000 policyholders, based upon first 3 months of each year. 


Especially when a small number of rates are being considered, 
as in Table 70, it is necessary to study both the rates and the per¬ 
centage distribution. Actually, the study of rates is required to 
answer the question: Is the rate from suicides greater in 1942 
than in 1941? Study of the percentage distribution of specified 
causes is required to answer the questions: In 1942 were motor- 
vehicle accidents a more important cause of death than influenza 
and pneumonia combined? Did motor-vehicle accidents become 
relatively more important from 1940 to 1942 as compared with 
the other specified causes? Important questions are answered 
by each of the sets of figures; what is necessary to avoid is the 
use of the wrong set of figures to answer a given question. 

Great Variety of Simple Index Numbers in Use. Hundreds of 
simple index numbers are in use, and the number has been 
increasing rapidly since the First World War. Indexes of the 
simple type illustrated in Tables 63, 65, and 68 exist for nearly 
every separate industrial activity, for thousands of prices, for 
retail sales, wholesale sales, inventories, consumption of certain 
types of goods, and for many other things related to economic 
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and social activity. Index numbers measuring production 
from month to month in a large list of industries have been com¬ 
piled and published by the Board of Governors of the Federal 
Reserve System and other agencies. Indexes of marketing of 
fish, dairy products, livestock, wool, and poultry and eggs have 
been compiled by the Bureau of Foreign and Domestic Commerce; 
and indexes of the marketing of cotton, fruits, grains and vege¬ 
tables, lumber, and other natural products are compiled by the 
same bureau. This bureau has also compiled and published a 
large number of simple relative figures for new orders and unfilled 
orders in a number of manufacturing industries, including iron 
and steel, paper, lumber, textiles; and another series of index 
numbers of commodity stocks of manufactured goods and of raw 
materials, such as chemicals, foodstuffs, metals, textile materials, 
and rubber products. These indexes are published currently in 
the Current Survey of Business by the United States Department 
of Commerce. In the League of Nations publications, indexes of 
world stocks of foodstuffs and certain raw materials are available. 

The United States Department of Commerce has recently 
begun the compilation and publication of indexes of transpor¬ 
tation for the United States. These monthly indexes include a 
combined index of all types of transportation, commodity and 
passenger, and also indexes by types of transportation, such as 
an index of air transportation and a combined index of intercity 
motorbus and truck transportation. The indexes are published 
monthly with the base period 1935-1939 = 100 and appear in 
the Survey of Current Business,^ This publication contains 
other illustrations of the many uses of index numbers. 

The use of either subordinate or coordinate relatives to aid 
in the interpretation of series of data does not involve the appli¬ 
cation of the theory of statistics or the principles of sampling, 
although the gathering, of the raw data may have involved the 
use of the latter. The rules of comparability must be considered 
when numbers are converted into relatives, however, as indi¬ 
cated in the discussion above. When a whole series of these 
simple index numbers, or relatives, are combined into a com¬ 
posite index number, it is necessary to make application of the 

^ The transportation indexes are described in the Survey of Current 
Business^ Vol. 22 (September, 1942), pp. 20-28; Vol, 2? (May, 1943), pp. 
20-27. 
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theory of statistics. The principles of stratified sampling apply 
to the construction of these composite index numbers. 

Index Numbers. Application of Sampling Technique, Index 
numbers are combinations of a large number of single series of 
relatives by some method of aggregating or averaging. In the 
field of prices, indexes of farm prices, of cost of living, of retail 
prices, of wholesale prices, of wages, and of exchange rates are 
some of the index numbers obtainable. Also, indexes of indus¬ 
trial production, of trade activity, of retail trade, and of 
.employment are found in various sources. All these indexes are 
combinations of numerous series of relatives. 

From consideration of the various purposes for which index 
numbers may be used, it should immediately be apparent that a 
difficulty is involved. How, for example, is it possible to get 
together all the facts in the United States regarding all whole¬ 
sale prices from time to time, or all wages, or all retail prices, or all 
production or consumption activities? The answer, of course, 
is that it is not possible, or certainly not feasible, but that a 
sample of some kind must be used. When a composite is made 
up of several series, how shall they be weighted? Should they 
be considered of equal importance, and if not how shall their 
relative importance be determined in making up the composite? 
It is upon the basis of the principles of sampling that such 
index numbers are justified. As Prof. Edgeworth once said, 
the task is to extricate from fallible observations a mean apt 
to represent the general trend of prices, wages, production, or 
whatever is being measured.^ 

The demonstration by eighteenth- and nineteenth-century 
statisticians, such as Stissmilch and Quetelet, that a hitherto 
unsuspected regularity lay hidden in numerical data about 
social phenomena encouraged economists and social scientists 
in the belief that known variations that had been measured might 
be fair samples of the more numerous unknown variations. 
Furthermore, the construction of a great variety of composite 
index numbers by different investigators using different methods 
has produced results of such consistency as to inspire confidence 
in their use.^ 

1 Cf. Mitchell, Wesley C., Busirissa Cydea—The Problem and Its Setting 
(1928), p. 204. 

* Cf, Mitchell, Wesley C., ‘‘Index Numbers of Wholesale Prices in the 
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To grasp the significance of an index number, it is not sufficient 
to have reference only to the summary picture it presents. Just 
as in the case of an average of one frequency distribution, so in 
the case of index numbers, the distribution of cases is of great 
importance. An index number is really a series of averages 
based upon a series of frequency distributions—one frequency 
distribution for each time period—of which the index number 
itself is an average of some sort. A study based upon this idea 
was made by Wesley C. Mitchell in his analj^sis of year-to-year 
fluctuations of the prices recorded in the wholesale price bul¬ 
letins of the Bureau of Labor Statistics, covering prices from 
1891 to 1918 and including 232 to 348 commodities. He found 
that the price changes from year to year formed a fairly sym¬ 
metrical frequency distribution each year, and hence he con¬ 
cluded that ^^when it can be shown that phenomena are 
distributed approximately in this fashion, their average can safely 
be accepted as a significant measure of the whole set of variations, 
since even the deviations from the average are then grouped in a 
tolerably definite and symmetrical fashion about the average.^'^ 

Such an analysis seemed to establish as satisfactory the use 
of an average to summarize price change from year to year; but 
index numbers frequently extend over a considerable period of 
time so that the general level of wholesale prices of 1942, for 
example, is compared with 1926 as a base or Avith 1935-1939. 
Year-to-year fluctuations may occur in a manner such that the 
average may be used to summarize; but what of change com¬ 
pared with some year more remote in the past? In order to 
test the reliability of the method of index-number construction in 
this regard. Prof. Wesley C. Mitchell applied the technique of 
taking several samples; in one sample he took 242 commodities, 
in another 50 commodities, and in a third sample 25 commodities 
at wholesale prices and constructed three sample index numbers 
for the period 1890-1913. He found that the results from the 
smaller samples were strikingly close to those of the larger 
sample.^ 

United States and Foreign Countries," Bureau of Labor Statistics, Bulletin 
284, p. 11. 

1 lUd., pp. 17-18. 

* IHd., p. 38. The theory of sampling errors does not apply in a way 
that makes possible mathematical tests from a single sample. 
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Stratified Sampling Method. Applied. Others have also found 
that whenever the principles of stratified sampling have been 
followed in the construction of index numbers of wholesale 
prices, the results obtained are similar to the results obtained by 
the use of all available data. This inspired confidence in index 
numbers extending back through the years, for which fewer price 
series are available in published records, and at the same time 
increased the belief that such an average expression of prices in 
the form of index numbers is a valid summary picture of general 
price change. 

It is upon the basis of the principle of stratified sampling that 
it is possible to measure by index numbers, such things as the 
cost of living, or the volume of production, or the general whole¬ 
sale price level. Also, it is upon the basis of the theory of 
sampling that credence can be given to index numbers; in 
addition, it is due to this very fact that it is necessary to examine 
the constituent parts of an index number to be sure that it 
measures what it purports to measure and that it is applicable to 
any particular problem for which it is desired to use an index 
number. 

It is necessary to notice that stratified sampling is applied to 
the making of index numbers. For example, take the problem 
of measuring general price movement. This is not a case in 
which there is an infinite number of items, although the universe 
is a very large number, and the number for which data are given 
is probably less than the number for which data are unavailable, 
particularly in the case of retail prices or wages. In the case of 
wholesale prices the available data cover a larger portion of the 
universe. 

Not only is there not an infinite number of items, but the 
number of available items is often not a very large one. For 
example, some index numbers are based upon less than 50 
individual index-number series. However, the universe from 
which the items are taken is one concerning which a priori 
knowledge exists. According to such a priori knowledge, a 
representative sample can be obtained by a conscious or delib¬ 
erate proportional selection of items from the various known 
strata of the universe. For example, it is known, in the case of 
wholesale prices, that the universe is made up of prices of foods, 
prices of metals, prices of forest products, prices of various raw 
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materials, prices of semimanufactured products in a number of 
fields that can be classified, and prices of final goods at whole¬ 
sale, to enumerate a few of the known strata of this universe. 
Knowing that such strata exist in the universe, the sample can 
be made proportional by a deliberate stratified random sampling 
procedure that would ensure proper representation in the sample 
of all the various strata kno^vn to exist in the universe.^ 

Variety of Purposes of Index Numbers, As a historical propo¬ 
sition, the original all-pervading purpose of an index number 
was to measure general exchange value, that is to say, to explain 
the relationship between prices, in their general or average move¬ 
ment, and the value of money and credit. 

At the present time, however, a large number of general 
indexes of prices and other phenomena are currently published; 
but few even of the general price indexes purport to be a measure 
of the value of money. General indexes of retail prices, indexes 
of wages and pay-roll totals, indexes of prices of farm prices, 
metal products, manufactured goods, and raw materials, as 
well as general wholesale prices, are now available. Which of 
these price indexes really measures the value of money? 

Some statisticians and economists have held that a real 
measure of the changes in the value of money and credit should 
include, not only wholesale prices, but also wages, rent, and 
other prices, including retail prices and perhaps the prices of 
securities. Samples of each kind of price should be included in 
the index of prices that aims to measure general exchange value. 
On this theory, Carl Snyder, at that time statistician for the 

1 Cf, King, W. I., Index Numbers Elucidated, especially pp. 64-66. This, 
of course, often turns out to be a counsel of perfection in practice. The 
principle is based upon the assumption that in each of the strata designated 
the available data can be sampled successfully at random; and in practice 
this is often not true. For illustration, in gathering prices for an index of 
wholesale prices such subgroups of prices, or strata, as sulphuric acid and 
Portland cement are standardized, while house furnishings are not. From 
the point of view of obtaining the best possible results with the minimum 
amount of price gathering, and presumably with limited funds for the 
purpose, it would be sound practice to abandon the counsel of perfection 
and spend less money gathering prices of standardized articles and more 
gathering prices of nonstandardized articles. The resulting disproportionate 
amoimt of prices in the respective subgroups can then be countered by the 
required adjustments in the weights used to combine the series of relatives 
into index numbers. 
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Federal Reserve Bank of New York, compiled an ^Undex of 
general price level,including wholesale and retail prices, wages, 
rents, etc., but for certain reasons he excluded security prices. 
After careful study, these various components were given certain 
weights in the general composite. It should be pointed out that 
this index of general price level was originated for the special 
purpose of deflating data on bank clearings. Since bank clear¬ 
ings included payments for all these things, Snyder believed that 
an index of prices based upon these components could be used to 
cancel out that part of change in total bank clearings due to 
price change and obtain thereby an index of physical volume of 
trade. Even if it is granted that this index of general price level 
is valid as a deflator of bank clearings, it still remains a question 
whether or not it measures the exchange value of money. 

It could be argued with considerable force that such a general 
measure is impracticable because of the difficulty of getting 
adequate samples of rents, for example. And in any case, such 
a general measure of prices does not really give the measure of 
change in the purchasing power of money. The general pur¬ 
chasing power of money may be a far more flexible and possibly 
sensitive factor than this general price average would indicate. 
A general price average would include an overweight of prices 
largely controlled by custom, or of prices in which resistance to 
change is very great for some other reason, as, for example, 
because of public regulation, taxation, or their indirect effects. 
The true measure of change in general exchange value may 
be more nearly approximated by the wholesale price index and 
perhaps even by the group of more sensitive wholesale prices. 

It is not the purpose here to carry this argument to a conclusion 
but merely to suggest its unsettled state. It may be significant 
that the Bureau of Labor Statistics has published reciprocals of 
its several indexes of prices—^wholesale, retail, cost-of-living, and 
farm products—as indexes of the purchasing power of the dollar 
in those respective fields. The question of how to measure 
general exchange value, or the purchasing power of money, con¬ 
tinues to be a controversial one. Meanwhile, index numbers 
continue to serve enormously useful special purposes whether 
or not collectively or individually they measure general exchange 
value. In his Treatise pn Money, J, M. Keynes appears to 
suggest that the exchange should be looked upon as a number of 
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relatively noncompeting groups of markets and that there may 
be no such thing as a general purchasing power of money. ^ 

In the light of such theoretical difficulties hindering the 
proper measurement of the purchasing power of money by using 
reciprocals of price indexes, recent attempts have been made to 
construct indexes of the purchasing power of money by other 
means. One notable contribution is the index of purchasing 
power constructed by Murray Shields; this combines monetary 
data, viz,, demand deposits, foreign deposits in the United 
States, foreign bank deposits in Federal reserve banks, volume of 
money in circulation, and cash in the vaults of commercial 
banks. ^ 

Construction of Index Numbers. Principal Methods. Prof. 
Irving Fisher of Yale University, in a comprehensive study 
of the mathematics of index-number making, found several 
hundred kinds of formulas for calculating index numbers; but 
it is quite unnecessary to be disturbed by this fact, since as 
he himself says, only a few of them are of any value. There are 
two principal methods of calculating index numbers now in use 
and generally recognized as adequate for most purposes, but 
other methods are occasionally used and will therefore be 
described. The most commonly used are (1) the weighted 
average-of-relatives method and (2) the weighted aggregative 
method. Other methods sometimes used are (3) the simple 
average-of-relatives method and (4) the simple aggregative 
method. Various alternative ways of applying these methods 
are possible. For example, in the case of the simple average of 
relatives, sometimes the median is used instead of the arith¬ 
metical mean in order to avoid extreme variations; it is advisable 
to use the median especially for very small samples. These 
methods will be taken up in the order (3), (1), (4), and (2), 
which is the logical method of treating them, rather than in the 
order of their prevalence in use, which is that given above. 

Simple Average-of-relatives Method. Referring again to the 
simple case of the prices of coffee, wheat, and canned peaches, 
already used, perhaps the first method that would suggest itself 

^C/. also Beckhart, B. H., The New York Money Market, Vol. 2, and 
Kino, op. dt, pp. 189-216. 

* “A Measure of Purchasing Power Inflation and Deflation,” Journal of 
the American Staiistical Association, Vol. 35 (1940), pp. 461—470. 
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to anyone desiring to obtain a summary figure representing 
average price change would be to add up the relatives and divide 
by their number, as follows: 


Table 71.—Composite Index Number of the Prices op Coffee, Wheat, 
AND Canned Peaches 
1926 = 100 


Commodity 

1 

1926 

1933 

1941 

Coffee. 

^» = 106 
po 

^ = 100 

Po 

4 = 100 

Po 

^ = 43 

Po 

4=58 

Po 

% = 48 

Po 

— = 44 

Po 

4 = 77 

Po 

% = 66 

Po 

Canned peaches. 

Wheat. 


Average. 

3)300 

100 

3)149 

50 

3)187 

62 



The resulting composite index number shows that on the 
average these three prices fell to 50 in 1933 in comparison with 
100 in 1926 and then rose on the average to 62 in 1941 in com¬ 
parison mth 100 in 1926. Reducing this method to symbols, 
Let po, pi, P 2 represent the prices of coffee. 

Po, p'l, P2 represent the prices of canned peaches. 

Po', Pij P 2 represent the prices of wheat. 

The relatives that appear in Table 71 are thus shown also in 
symbols. For example, the ratio p”lv” corresponds to 66—^in 
these symbolical presentations the multiple 100 is always under¬ 
stood^^ and not actually written in the formula. The averages 
for the three would be expressed by symbols in Table 72: 


Table 72 

1926 

1933 

1941 

Po , pj , p'o' 
"T / T // 
Po Po Po 

/ It 

Pi , Pi _L_ Pi 

T* f "T A/ 

Po Po Po 

/ // 
P 2 , P 2 , P 2 
"r / T* // 
Po Po Po 

3 

3 

3 


These averages are represented "by the letter P, and when N 
commodity prices are averaged, instead of only three, for n years, 
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instead of only 3, the series of averages of relatives is repre¬ 
sented symbolically as follows: 



The capital N refers to the number of prices, and the small sub¬ 
script n refers to the number of years, or number of time periods, 
which might be months or weeks as well as years. In general, 
the subscripts to the series of represent the time periods, and 
0 is assigned to the base time period, at which the relative equals 
100. The average of the relatives likemse equals 100 in the 
base time period. The primes refer to different commodities. 

Weighted Average-of-relatives Method, The simple average of 
relatives involves the assumption that changes in the several 
prices to be combined are of equal importance; but this may not 
be true. Consequently, the idea of weighting the component 
price relatives in accordance with weights that are considered to 
reflect their‘relative importance has been developed. 

The weights are commonly based upon some rational con¬ 
sideration such as the quantities consumed in a given represen¬ 
tative year, the quantities produced, family budget figures, or 
some other criterion. Suppose, after considering all available 
information on the subject, changes in the price of a pound of 
coffee are considered thrice as important as changes in the price 
of a dozen cans of peaches and changes in the price of wheat per 
bushel are judged twice as important as changes in the price of a 
pound of coffee. Convenience of calculation will be attained 
if the numbers used as weights are so arranged that they will 
sum up to 1, 10, or 100, because the averaging process will then 
be a simple matter of changing decimal points in the sum of the 
weighted relatives. Such a manipulation of the quantities 
representing weights will have no effect on the final answer and 
will reduce the amount of work considerably if the problem is a 
long one involving, say, several years of monthly indexes. In 
the illustration used above, the weights are as follows: 

Coffee, 3 w 
Canned peaches, 1 = 

Wheat, 6 = w" 
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A weighted average-of-relatives index number of these three 
commodities would be calculated as illustrated in Table 73. 


Table 73.— Index Number of the Prices of Coffee, Wheat, and 

Canned Peaches 

Weighted average of relatives, 1926 = 100 


Commodity 

1926 

1933 

1041 

Coffee. 

100 X 3 = 300 

43 X 3 = 129 

44 X 3 = 132 

Canned peaches. 

100 X 1 = 100 

68 X 1 = 58 

77 X 1 = 77 

Wheat. 

100 X 6 = 600 

48 X 6 X ?88 

66 X 6 X 396 

Weighted average. 

10)1000 

10)475 

10)605 


100 

47.5 

60.5 


In symbolic language, the weighted average of relatives illus¬ 
trated in Table 73 is as follows: 


P 


00 





Poi = 




Pi 

w — 

Po 




P 02 





( 2 ) 


Instead of weighting by arbitrary weights, the actual quan¬ 
tities of the articles consumed or produced in the base year are 
sometimes used as weights, if such data are available. The 
quantities of the base year or base period are retained through¬ 
out, instead of getting the new quantities each year or each time 
period, for two reasons: (1) because it is difficult if not impossible 
to get quantity figures for every year and (2) because the pro¬ 
portions between these quantities are not likely to change 
greatly over short periods of time. If, after a given base period 
has been used for some time, it is discovered that one or several 
of the quantity weights are at variance with current conditions 
that seem to be likely to persist, the system of weights may be 
revised. In the various index numbers it constructs, the Bureau 
of Labor Statistics keeps continually on the watch for such 
changing conditions and when desirable changes the weighting 
system. 

The symbols for quantity weights are series of ^’s, as follows: 

go =5 quantity of coffee in 1926 
qi = quantity of coffee in 1933 
q% - quantity of coffee in 1941 
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Qq = quantity of canned peaches in 1926 
q[ = quantity of canned peaches in 1933 
§2 = quantity of canned peaches in 1941 

Wheat would be the same arrangement of a series of g"’s. The 
resulting index number, using base-year quantities as weights for 
the relatives averaged, would be as follows: 



Simple Aggregative Method. As suggested by its name, a 
simple aggregative index is the sum of the absolute prices, 
without first changing them to relatives. Thus the raw 
prices of coffee, canned peaches, and wheat for 1926, then for 
1933, and then for 1941 would be added together to give 
the index. This seems to be combining nonhomogeneous 
things, and it is; nevertheless, there is one famous and at one 
time widely used index that was based upon this method. 
Such was Bradstreet’s index of wholesale prices, which continued 
in use for many years. Following is an illustration of the method 
of Bradstreet^s index of wholesale prices:^ 

Prices, Dollars per Pound 
0.0007 Connellsville coke, southern coke 
0.001 Bituminous coal, brick, iron ore 

0.002 Anthracite coal 

0.003 Salt 

0.004 Bessemer pig iron 


0.34 Alcohol 

0.50 Australian wool 

0.52 Quicksilver 

0,84 Rubber 

9.8530 The sum, which is the index 

According to this method, the index number of prices does 
not assume the form of a relative, but appears as follows: 

^ A good description of Bradstreet^s index is contained in W. C. Mitchell, 
‘‘Index Numbers of Wholesale Prices in the United States and Foreign 
Countries,” Bureau of Labor Statistics, Bulletin 284, pp. 161-165. Other 
price indexes are also discussed in that source, such as Dun’s, Gibson’s, 
the Annalist, War Industries Board, Federal Reserve Board, and the 
Bureau of Labor Statistics. 
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Table 74.— Bradstreet^s Index 


1933 

Index 

1934 

, Index 

October. 

$9.0512 

8.8480 

8.8126 

January. 

$8.8329 

9.0110 

9.2627 

November. 

February. 

December. 

March. 




The index can readily be converted into a series of relatives 
upon any chosen base; the Survey of Current Business published 
Bradstreet^s index converted into relatives, with the monthly 
average of 1926 = 100 until November, 1937, when compilation 
of the index was discontinued.^ 

Little rational justification can be mustered to the defense of 
such an index as Bradstreet^s, except that it worked well. Using 
approximately 96 commodities, it gave an index number that 
reflected accurately the changes in wholesale prices, as tested 
by more elaborately conceived and compiled indexes of 
wholesale prices later introduced into the field. Bradstreet^s 
index was the pioneer in the history of price indexes in the 
United States, having been started in 1897. The conversion of 
all prices into prices per pound gives the effect of a concealed 
weighting, but no logical basis can be found for such a system of 
weighting. The symbolic expression of this index is as follows: 

Spo, Spi, 2p2, . . . , Spn (4) 

When reduced to relatives and some base is taken as 100, it is 
as follows: 


■Poo — 


2po' 


Poi = 


Spi 

2po 


Pon — 


2pn 

2po 


(5) 


While the concealed weighting system of Bradstreet^s index is 
accidental, or haphazard, depending upon the units in which 
goods are quoted, it has the effect of making the high-priced 
articles dominant. Its success as a good index of price change 
was due to the fact that there was a skillful or at least a pro- 


^ Cf, Current Survey of Businessj Supplement, Vol. 18 (1938) p. 168. 
Monthly figures for the index are available from 1903 and annual figures 
from 1890. See 1932 Supplement, pp. 28-29 and 1936 Supplement, p. 15. 
Also see Bureau of Labor Statistics, Bulletin 173 (July, 1915). 
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pitious use of stratified sampling in the selection of the prices 
used. 

Weighted Aggregative Method, In making index numbers by 
the aggregative method, it is usually considered that weights 
are required, for the same reason that they are regarded as 
necessary in constructing an index number by the average-of- 
relatives method. The most reasonable kind of weight would 
seem to be the quantities of the several commodities produced or 
consumed or marketed. Such figures have become increasingly 
available since the time when such indexes as Bradstreet^s were 
originally conceived and developed. 

The last four or five decennial censuses of the United States 
have included more and more complete data on physical quan¬ 
tities of production and, more recently, data on retail and 
wholesale trade; and, in the years since the First World War, 
yearly figures have been available on physical quantities of goods 
in stock and physical production of some goods, through the 
activities of the United States Departments of Commerce and 
Agriculture. If it is assumed that the method of w^eighting is 
one that uses actual quantity figures, there are two methods of 
weighting the price aggregates in order to construct the index 
number. The first method is called weighting by base-year 
quantities.^’ The second method is called weighting by given- 
year quantities.” 

The desirability of weighting by base-year quantities has a 
twofold explanation: (1) In spite of the increased availability 
of quantity figures, there are still many commodities for which 
quantity figures are not easily available for every year; but a 
large number of such quantity figures, classified so as to be 
useful for weighting purposes, are available for the census years. 
(2) With few exceptions, the proportional changes in the quan¬ 
tities or value weights from year to year are not sufficiently great 
to cause large errors if these proportions are assumed to remain 
constant for several years in succession. Adjustments in the 
quantity or value weights can be made in the case of rapidly 
growing or rapidly declining industries, but the necessity for such 
changes within 10-year periods will not include a very large 
number of commodities. As a purely practical matter, the 
choice of base-year weighting instead of given-year weighting 
gives adequate results with much less statistical calculation, as 
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well as much less statistical research in seeking data for use 
as weights. 

United States Bureau of Labor Statistics. Construction of 
Index Illustrated by Practice of an Official Bureau, In the 
United States, the Bureau of Labor Statistics is one of the 
most important official compilers of index numbers of various 
kinds. From its publications can be illustrated how the various 
matters discussed above are brought into practice' and how 
diligent must be the researcher, how alert the statistician, to 
new problems of weighting, sampling, and the like. 

In 1943 the Bureau of Labor Statistics of the United States 
Department of Labor was compiling and publishing weekly, 
monthly, and annual index numbers of wholesale commodity 
prices. In a revision made in 1927, when the base period was 
changed from the 1913 average to the 1926 average, a new weight¬ 
ing system was adopted; it was then decided to revise the quan¬ 
tities used as weighting factors every 2 years, as the results of 
each new biennial census of manufactures became available. 
At the same time, the number of price series was increased from 
404 to 550. Another revision was made in 1931, when the num¬ 
ber of price series was changed from 550 to 784 and some rear¬ 
rangement of the items in the groups and subgroups was made. 
No change was made in 1931 in the method of calculating the 
indexes. In December, 1942, according to the Survey of Current 
Business j the monthly index of wholesale prices compiled by the 
Bureau of Labor Statistics was made up of 889 quotations.^ 

The weights used for farm products are based on averages 
for 3-year periods, changed every 2 years in order to keep the 
weights up to date. Thus, for the years 1932 and 1933, the 
weights used for farm prices were based upon averages of quan¬ 
tities marketed in the years 1927, 1928, and 1929; and for the 
years 1934 and 1935 the weights used for farm-products prices 
were based upon averages of quantities marketed in the years 
1929, 1930, and 1931. For all other groups of commodity prices, 
the weights used are averages of quantities produced for sale, to 

^ Survey of Current Business^ Vol. 22 (December, 1942), p. S-3. On the 
history of its compilation, weighting, etc., see Bureau of Labor Statistics, 
Bulletin 181, 415, 453, 521; “ Wholesale Prices ,Serial No. R1434 (Decem¬ 
ber, 1941); ‘‘Revised Method of Calculation of the Wholesale Price Index 
of the United States Bureau of Labor Statistics,” Smal No. R.666. 
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which has been added the average of imports for consumption, in 
the last two completed census periods. For example, for the 
years 1932 and 1933, the weights were based on average census 
data (plus imports for consumption) for the years 1927 and 1929, 
vrhereas for the years 1934 and 1935 the weights were based on 
average census data (plus imports for consumption) for the years 
1929 arid 1931. In cases where census data are lacking, esti¬ 
mates are made of the quantities of the various commodities 
marketed, based on the best information available from govern¬ 
mental and reliable private sources; and these estimates are used 
as freighting factors. Commodities are added or dropped from 
time to time as they become important or cease to be important 
in the markets.^ 

During the period of depression following 1932, when the 
data on manufactured output became violently disrupted, the 
weights based upon averages of the years 1929-1931 were 
retained. Most of the prices continued in 1943 to be weighted 
by the averages of the 1929-1931 census data, but for certain 
commodity groups new weights had begun to be based upon 
special studies of those groups. Thus, in April, 1941, the Bureau 
published a study of the ‘^Wholesale Price Trends of Carpets and 
Rugs^' revising its price series for this group. This study 
included new weights for the prices in this group, according to 
their “importance in the country’s markets in 1939.” 

The quantity weight used for each of the series, the unit in 
which each is priced, and the 1939 value of each item expressed 
as a percentage of the aggregate value of all carpet and rug items 
in the Bureau’s indexes are shown in Table 75. 

The use of data for 1939 departed from the “general practice 
of using the 1929 and 1931 data for weighting in the Bureau’s 

^If a ‘‘price” index, as contrasted with a “realized price” index, is 
desired, it is necessary to keep the weights constant. In constructing a 
realized price index the weights may be changed, but in revising the weights 
the index must be calculated by using both sets of weights for the over¬ 
lapping year or period when the change is made. By “realized price” is 
meant the dollars covering the transaction, divided by the units involved in 
the transaction. When the lack of continuity of specifications makes it 
hard to define the commodity, as with automobiles, the dollars of sales for 
each general type (sedans, coupes, etc.) divided by the number of such 
units, in other words, the “realized price,” has received the endorsement 
of competent price experts as an acceptable quotation to use in price 
statistics. 
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Table 75 


Price series 

Unit in which 
priced 

Weight 

Axminster f carpets. 

Lineal yard 
Each 

Lineal yard 
Square yard 
Each 

7,077 

2,015 

6,424 

7,861 

612 

Axminster 9 X 12 rugs. 

Plain velvet f carpets. 

Plain velvet -j- carpets. 

Wilton 9 X 12 rugs. 



Source: Bureau of Labor Statistics, mimeographed publication, “Wholesale Price Trends 
of Carpets and Rugs,” April, 1941, pp. 16-17. 

wh6lesale price indexes/’ in order to provide weights for the 
individual items that reflected their relative importance more 
nearly in accordance ^v^th present-day sales. The Axminster 
type has long been the most popular. The relative importance 
of Wilton carpets and rugs has increased considerably since the 
depression of the early 1930’s, and they have regained much of 
their earlier popularity. Prior to the depression and before 
plain velvets became popular, the importance of Wiltons, on a 
dollar basis, was almost as great as that of Axminster carpets 
and rugs. During the depth of the depression, when consumer 
incomes were greatly reduced, there was a lessened demand for 
Wiltons, apparently because they were much more expensive, 
on the average, than Axminsters. 

The study of the carpet and rug price series is presented to 
illustrate the alertness of the Bureau in relation to the problem of 
compiling and publishing its indexes of wholesale prices. Its 
activity extends to other groups of price series as well. For 
example, beginning with January, 1938, the results of a survey 
of farm-machinery wholesale prices were incorporated for the 
first time in the Bureau of Labor Statistics general indexes of 
wholesale prices. In 1941 the Bureau began publishing weekly 
indey numbers of waste and scrap materials, carrying the index 
back to January, 1939. In Wholesale Prices” (June, 1941), 
the Bureau published a monthly index of standard machine-tool 
prices, including 11 types of standard nonspecialty machine 
tools, carrying the index back to January, 1937. These new 
indexes are calculated on August, 1939, as a base; the monthly 
index of wholesale prices continued in 1943 to be based on the 
average of 1926 as 100. 
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Method of Computation IllustrcUed, The weighted aggregative 
method of computing an index number of prices is illustrated in 
Table 76, using only five price series. The five price series 
selected for illustration are those for carpets and rugs, for which 
the Bureau^s weights are shown in Table 75. A procedure 
similar to that illustrated in Table 76 is used by the Bureau, 
but with 889 price quotations instead of 5. 

Table 76. —Work Sheet Illustrating Calculation of Weighted 


Aggregative Index 
Base-period weights 



Average price 


Weighted price 

Commodity 

1935-1939 

po 

1941 

pi 

Weights 

«o 

1935-1939 

poqo 

1941 

PiQo 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Axminster | carpets. 

1.567 

2.014 

7,077 

11,090 

14,253 

Axminster 9 X 12 rugs... 

22.745 

27.936 

2,015 

45,831 

56,291 

Plain velvet | carpets.... 

1.772 

2.356 

6,424 

11,383 

15,135 

Plain velvet ^ carpets... 

2.581 

3.266 

7,861 

20,289 

25,674 

Wilton 9 X 12 rugs. 

40.007 

50.521 

612 

24,484 

30,919 

S = 

X(100/Spo^o) 1 




113,078 
= 100.00 

142,272 
= 125.82 


Poi = Spigo/pogo 


Source: Cutts, Jesse M., and Samuel J. Dennis, "Revised Method of Calculation of 
Wholesale Price Indexes," Journal of the American Statistical Association^ Vol. 32 (1937); 
also reprinted by the Bureau of Labor Statistics as Serial No. 666. 


Accordingly, the index number of the wholesale prices of 
carpets and rugs is 100.00 for the years 1935-1939, the base 
period, and 125.82 for 1941. The latter figure is obtained by 
taking Poi = 2)pigo/2pogo. From the system of symbols already 
introduced, the symbolic presentation of this form of index 
number is as follows; 


_ Spogo p _ Spigo _ 2p2go . . . p _ 

Spo«o’ SpoSo’ Spo«o 2po9o 


( 6 ) 


This is illustrated in the figures of Tables 75 and 76, the 1941 
142 272 

index being ^^3 X 100 = 125.82. 
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Price Indexes and Quantity Indexes. Aggregative Index Using 
Given-year Weights, If given-year quantity weights are avail¬ 
able and used for computing an index, it must be noted that the 
following system of ratios would merely give an index of chang¬ 
ing aggregate values, without distinguishing which part of the 
change is due to price change and which part to quantity change: 


Spogo’ 




Spigi 

2pogo' 


Rq2 = 


Sp2g2 

Spogo 



^Vnqn 

2pogo 


Such an index is an index of aggregate value, made up partly of 
changes in quantity and partly of changes in price. In order, 
therefore, to extract from it that part of the change which is due 
solely to price change, the base-year prices must be multiplied 
throughout by the given-year Aveights. This fact makes the 
given-year weighting method a very long one to calculate; it loses 
the advantage, inherent in the aggregative index weighted by 
base-year quantities, of having a constant divider in securing the 
index. In addition, the method of weighting by given-year 
quantities necessitates the two sets of cross products for each 
year—each yearns prices multiplied by that yearns quantities 
and by the base-year quantities. Following is the symbolic 
expression of the aggregative index of prices Aveighted by given- 
year quantities: 


p _ 2pogo 


Poi 


2 pigi 

2 pogi' 


_ 2 p 2 g 2 
2 ;pog 2 

P On 


^VnQn /yv 

Spogn ^ ^ 


Index of Quantities Weighted by Prices. An advantage of the 
given-year Aveighting method is that an index of quantities 
weighted by prices can be obtained as a by-product, Avith com¬ 
paratively little additional calculation. The same numerators as 
those used in Eq. (7) can be used to calculate an index of quan¬ 
tity change weighted by given-year prices. For each year, 
given-year prices are multiplied by base-year quantities, and 
these aggregates are used as dividers. This vrill give an index of 
quantity weighted by given-year prices, as follows; 
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Qqo 


^Poqo 

^Poqo 



^Piqi 

^Piqd 


Qo2 




( 8 ) 


Unfortunately, this advantage in the given-year weighting 
method is largely imaginary because the quantity data are not 
available soon enough for short periods of time to make it prac¬ 
ticable to construct monthly or weekly indexes. In any case, it is 
also possible to obtain a quantity index weighted by given-year 
prices, using the following equation, which would provide a 
much simpler method: 


Ooo 


Spogo 

2 pogo' 


Qoi 


^Poqo 


Qo2 


^Poq2 

^Poqo 


(9) 


Quantity indexes are constructed, however, by other methods, 
usually with more general application of stratified sampling and 
with other weights than prices, largely because of the diflSiculty 
of obtaining quantity data. Not only are these other methods 
more convenient to calculate, but they make it possible to 
handle matters having to do with weighting and bias in the 
results. In using equations like Eqs. (8) and (9), it is often very 
difficult to appraise the inaccuracies due to bias inherent in the 
method. 

Quantity Indexes and Business Barometers. Indexes of 
Quantity of Trade or Production, Several statisticians and 
economists made attempts, especially in the years immediately 
following the First World War, to construct an index that would 
trace variations in the physical volume of production or trade. 
Pioneer efforts to construct such indexes, based upon scant 
material and with little in the way of a statistical theory to 
guide them, were made before the First World War by Wesley C. 
Mitchell, Irving Fisher, and Edwin W. Kemmerer. During the 
war and postwar period important progress was made, especially 
by Edmund E. Day, Warren M. Persons, and others. In 1923, 
the latter published an index of trade for the United States, 
beginning with the year 1903.* The index of production is 
based very heavily on the index of employment; it might there¬ 
fore fail to reflect properly the results of technological advance. 


* An Index of Trade for the United States,” Review of Economic Statistics^ 
Preliminary Vol. 5 (April, 1923), pp. 71-78. Cf. also Gabfield, Frank R., 
‘^General Indexes of Business Activity,” Federal Reserve Bulletin, Vol. 26 
(1940), pp. 495-501. 
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After these experiments by pioneering individuals, several gov¬ 
ernment agencies and privately financed research organizations 
took up the task of developing indexes of trade and production. 
The most widely known and now most currently used index of 
industrial production for the United States is that compiled and 
regularly published in the Federal Reserve Bulletin by the Research 
Division of the Board of Governors of the Federal Reserve Sys¬ 
tem. This index is compiled from 95 individual series of monthly 
data, representing about 85 per cent of the total industrial pro¬ 
duction of the United States. The series include 22 durable- 
goods manufacturing industry series, 63 nondurable-goods 
manufacturing industry series, and series representing production 
of fuels and metals. This index is also regularly reproduced in 
the Survey of Current Business^ published by the United States 
Department of Commerce.^ A reproduction of the entire index, 
with its component parts, 1923-1940 by months, with the aver¬ 
age 1935-1939 = 100 as the base, can be found in the Federal 
Reserve Bulletin.^ 

It is characteristic of the indexes of physical volume of produc¬ 
tion or trade that they consist of combinations of various series 
upon the basis of stratified sampling, the weights for the repre¬ 
sentative series being devised upon a priori knowledge concern¬ 
ing the importance of certain groups of activity in relation to the 
whole of business activity. These indexes treat the separate 
series statistically before putting them together. For example, 
they remove seasonal variation and trend from the separate 
series and thus average together the cycles of the various separate 
series into the composite. The method of averaging employed 
is generally the aggregative method, although since 1940 the 
Federal Reserve Board uses an average of relatives weighted by 
quantities so that the final result is equivalent to what would be 
obtained by using the weighted aggregative method.^ 

^Federal Reserve Bulletin, Vol. 13 (February, March, 1927), Vol. 17 
(February, September, 1931), Vol. 18 (March, 1932); for adjustments made 
necessary by the 1942 world war, see Vol. 27 (1941), pp. 878-881; c/. Survey 
of Current Business, Vol. 20 (1940), pp. 11-17. 

* Vol. 26 (1940), pp. 825-882; see also “Answer to Critics of the Index,” 
pp. 1047-1049. See also Woodliep, Thomas, and Maxwell R. Conklin, 
“Measurement of Production,” Fed^al Reserve Bulletin, Vol. 26 (1940), 
pp. 912-924. 

* See WooDLiBF and Conklin, op. cit. 
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Computation of Weights Illustrated, The index of manufac¬ 
tures published by the Federal Reserve Board is weighted by the 
total value added by manufacture in the case of all manufactur¬ 
ing industries, and the index of mineral production is weighted 
by value of mineral products. The sum of these two is the 
index of production. The individual production series of which 
the manufacturing index is composed are weighted, as nearly as 
possible, according to the same principle.^ Accordingly, the 
total value added by manufacturing industries in 1937, as 
reported by the United States census, was distributed among the 
16 groups represented in proportion to the value added for each 


Table 77.— ^Relative Impoktance op Industky Groups and Selected 
Industries Included in the Federal Reserve Board Index 
OF Industrial Production 


(Per cent of total with 1937 weights) 


Series 1937 Weights 

Industrial production. 100.00 

Manufactures. 84.80 

Durable manufactures. 37.93 

Iron and steel. 11.00 

Machinery production. 10.81 

Transportation equipment. 5.92 

Nonferrous metals and their products. 2.81 

Lumber and its products. 4.39 

Stone, clay, and glass products. 3.00 

Nondurable manufactures... 46.87 


Textiles and their products.. 

Leather and its products. 

Manufactured food products. 

Alcoholic beverages. 

Tobacco products. 

Paper and its products. 

Printing and publishing. 

Petroleum and coal products 

Products of chemicals. 

Rubber products. 

Minerals. 


11.22 

2.28 

10.92 

1.84 

1.24 

3.13 
6.44 

2.14 
6.27 
1.39 

. 15.20 


Fuels. 13.01 

Metals. 2.19 

Source: Condensed from Federal Reeerve Bulletin, Vol. 26 (1940), p. 919. 


1 For industry series in which census data on value added by manufacture 
are not available other criteria had to be used, such as total value of manu¬ 
factured product, raw materials consumed, or man-hours worked. See ihid.y 
pp. 917-918. 
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group, and then derived group totals were subdivided among 
industries and finally among individual products in a similar 
manner. Each individual series is thus assigned a hypothetical 
value-added figure, which is then divided by the relative of the 
1935-1939 compared vdth 1937 in order to convert 1937 value- 
added figures to 1935-1939 base. The derived 1935-1939 figures 
for each series are then expressed as percentages of their own 
total to obtain the weights. These percentages represent the 
estimated relative importance of each series in the 1935-1939 
base period and are the weights applied to the relatives in 
combining them into the index of production. Table 77 repro¬ 
duces a summary of these weights. 

Using the weights shown in Table 77, the equation for the 
Federal Reserve Board^s index of industrial production is as 
follows: 

in which 


w 


Sp37^o 


Ps 7 represents the value (or value added) per unit of output in 
the weight-base period. 

Barometers^ or Indexes, of General Business Conditions, Some 
composite indexes purport to be barometers or indexes of business 
and trade in general. These indexes are of two types: (1) A 
single series is sometimes believed to be a barometer of general 
business conditions and (2) a number of indexes of trade activity 
are combined in order to measure general business conditions. 

Of the first type, the most prominent one at present is prob¬ 
ably the index of electrical-power production, which is compiled 
from quantity figures published by the Geological Survey. The 
index of activity in the steel industry at one time was looked 
upon as a good barometer of general business conditions because 
so many industries are dependent upon steel or steel products. 
The trends in the average of security market prices are sometimes 
taken as a barometer of coming business conditions, or at least 
as a measure of existing conditions. In wartime, the security 
markets often reflect conditions and war information that are 
not generally publicized’. 
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A good example of the second type of index of general business 
conditions is that published currently by the New York Times and 
formerly by the Annalist. This index has been widely used 
and until October, 1937, was reproduced in the Survey of Current 
Business, published by the United States Department of 
Commerce. 

A more comprehensive example of this second type of index 
of business conditions is one that has evolved from the pioneer 
work of Carl Snyder, whose procedure was based upon the 
theory that the fluctuations in total bank clearings are made up 
of two variables: (1) price change and (2) change in physical 
volume of trade. By constructing a price deflator, which has 
already been discussed, and then by using this deflator to cancel 
out from aggregate bank clearings that part due to price changes,. 
he sought to obtain an index of physical volume of trade for 
the years 1875-1924.^ Modifications and refinements were made 
in the construction of this index by Leroy M. Piser, so that it 
was known as the Snyder-Piser index of volume of trade for the 
United States. It included 89 series, classified as follows: pro¬ 
ductive activity, 46 series; primary distribution, 13 series; 
distribution to consumer, 8 series; financial activity, 6 series; 
general (such as life insurance, postal receipts, electrical-power 
corporations, farmers, and communication), 5 series; and finally 
debits outside New York City. Thus it came to be based upon 
the principle of stratified sampling. This index of volume of 
trade and production is published monthly by the Federal 
Reserve Bank of New York in its Monthly Review of Credit and 
Business Conditions.^ 

Various forecasting services compile their respective indexes of 
business conditions according to their particular interpretation 
as to what should best be included in such an index and how 
best to weight various factors. Carefully worked out indexes 
of the marketing of farm products and forestry products are now 
available as a result of the efforts of the Bureau of Agricultural 
Economics in the United States Department of Agriculture. 
These are reproduced in the Survey of Current Business. The 

1 Snyder, Carl, “A New Clearings Index of Business for Fifty Years,” 
Journal of the American Statistical Association, Vol. 19 (1924), pp. 329-335. 

* Johnson, Norris Q., '^New Indexes of Production and Trade,” Journal 
of the American Statistical Association, Vol. 33 (1938), pp. 341-348. 
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Bureau of Foreign and Domestic Commerce also publishes 
indexes of domestic commodity stocks, and of world stocks of 
certain outstanding industries, which constitute good barometers 
of the related business conditions.^ 

ADJUSTMENT OF INDEXES TO BENCH MARKS 

Ideal Conditions for Stratified Sampling Nonexistent In order 
to produce results approximating those of true random sampling, 
conditions favorable to random sampling in the subgroups or 
strata must exist. For one thing, this means that there must 
be large numbers of items from which to draw within each sub¬ 
group. It also means that the number of sample items must be 
sufficient to avoid the disadvantages of small samples. The law 
of large numbers must be given opportunity to produce the 
results of true random selection within each stratum; such is the 
sine qua non of truly successful stratified sampling. Under such 
conditions the method of selection causes no accumulation of 
bias. Such ideal conditions do not exist with reference to any 
known index number, not even the Bureau of Labor Statistics 
index of wholesale prices, which contains a total of more items 
than any other index. 

Nevertheless, the pattern of stratified sampling can with 
considerable advantage be adopted as the guide to procedure in 
the construction of indexes of all types. Following this pattern 
the investigator first works out a system of classification of the 
data for which an index is to be compiled. Using the subgroups 
of this classification, he can then proceed according to the prin¬ 
ciples of stratified random sampling so far as it is possible to do 
so. When he finds that conditions ideal for random sampling in a 
subgroup fail to exist, the investigator must resort to subjective 
means to secure results that he believes will be representative. 

Inasmuch as all indexes contain, in some part, data that have 
been collected and processed by the use of such subjective 
methods, employed in the absence of ideal sampling conditions, 
it is desirable wherever possible for the statistician to find bench 
marks with which he can compare the results of his sampling 

^Survey of Current Business, Vol. 20, Annual Supplement (1940), pp. 
83-164. For a more complete discussion of barometers of general business 
conditions see Wesley C. Mitchell, B'>jmness Cycles—The Problems and Its 
Setting, (1928), pp. 291-330; Joseph L. Snider, Business Statistics; and 
Garfield, op, cit. 
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procedure. A common-sense appraisal of the over-all result is 
the most generally used bench mark to judge whether or not the 
results are satisfactory; but this method presupposes an unusual 
amount of a priori knowledge and of scientific critical judgment 
on the part of the statistician. Sometimes more objective bench 
marks may be found to aid the statistical vrorker along his 
thorny path. These will be illustrated in the ensuing sections. 

Reasons Why Indexes Require Adjustment. The reasons why 
indexes require adjustment to bench marks do not necessarily 
arise from faulty application of the method of stratified sampling. 
They arise from the nature of the universe from which the sample 
is taken. In connection with most types of data collected for 
the construction of indexes, the universe is a discrete rather than 
a continuous one; in other words, the universe consists of com¬ 
paratively small numbers of units, each of which constitutes a 
comparatively large proportion of the whole universe. Often 
they cannot be considered as representative of each other. 

When such a comparatively small universe is subdivided, in 
order to apply the stratified sampling technique, the strata con¬ 
stitute universes Avith still smaller numbers. Added to this is 
the usual fact that only a portion of this remaining small num¬ 
ber is accessible to the data collector; in some cases, unavoidable 
bias itself constitutes a part of the reason for the accessible por¬ 
tion. Under such circumstances it is almost impossible to 
realize the essential condition of randomness of selection in the 
respective strata, and consequently stratified sampling technique 
gives less satisfactory results. 

Such is the situation Mnth. respect to sample data collected 
from business firms, especially manufacturing enterprises. In 
some of the subdivisions, corporate enterprise is on so large a 
scale that only a few firms represent a large portion of that 
stratum. In all subdivisions, the size of the sample return 
measures, not only changes in the trends it is desired to measure, 
but also success or failure of the collecting agency in persuading 
firms to report. The statistical technique of comparing iden¬ 
tical firms from month to month reduces but does not altogether 
obviate the cumulative error resulting from this weakness. 

In addition, growth of an industry, and hence growth in pay 
rolls, in output, in stocks of materials, or in whatever is the sub- 
je<?t of investigation, occurs not only in existing fi.rnis; but part 
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of the growth is in the rise of new firms in the industry. In some 
strata, perhaps the steel and machinery industries, expansion or 
contraction of existing firms (and hence those reporting in a 
stratified sample) may accurately reflect proportionately a rise 
or fall in business in that strata. But in other strata, like some 
branches of the textile industry or the food industry, the expan¬ 
sion or contraction of existing firms (and hence those reporting 
in a stratified sample) may not at all reflect proportionately the 
rise or fall in business. 

In heavy industry, where plant and equipment constitute a 
large proportion of the business investment, cyclical changes of 
the sample might well be much greater than cyclical changes in 
the universe. This would follow if the large firm, with heavy 
investment in plant and equipment, tended to curtail production 
instead of lowering prices when faced with declining business 
prospects. 

In some branches of the clothing and food industries, in which 
small investment in plant and equipment and large numbers 
of small firms predominate, cyclical changes may be smaller in 
the sample than in the universe. The birth of new firms or 
resumption of activity by old firms is the principal manner of 
expansion in such strata. The death of old firms is the principal 
means of decline. The reporting firms are quite likely to be the 
ones that would not die at a rate so rapid as the average rate in 
the industry. 

Circumstances like those just described constitute only an 
illustration of the type of problem facing the statistician, who 
must continually endeavor to improve the sample of reporting 
firms. Great as his efforts and ingenious as his imagination, 
may be the resulting sample is likely to show bias. 

For bench marks in connection with adjustment of indexes 
of employment and pay rolls, statisticians have made use of the 
successive issues of the census of manufactures, often called the 
Biennial Census of Manufactures, which appeared in 1914, 1919, 
and each odd year thereafter, including 1939. For years after 
1939 it should be possible to get similar bench-mark data from 
the records of the Social Security Administration. 

By using the census-of-manufactures data as bench marks, it 
has been possible to check up on the monthly or weekly sample 
results obtained by the sampling process and to adjust them for 
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any bias that is disclosed by such a check. As each new set of 
manufacturing census data became available, which was about 
two years from the time it was taken, such indexes could be 
adjusted to the census data for errors that had accumulated since 
the last census. In the meantime, the results of the sample were 
relied upon as the best available information; and at the same 
time subjective sampling procedures were continually studied 
with a view to improvement wherever possible. To this end, 
the adjustment procedure often discloses areas in which the 
sampling results are especially in need of improvement. 

The method of making such adjustment will be illustrated, 
not so much as a valuable statistical device in itself, but as an 
example to the student of the care and attention to form, pro¬ 
cedure, cross checking, and the like, required of a good statis¬ 
tician. Thus the following instructions, together with the form 
used, are presented as an exhibit to help the student visualize 
how a statistician plans his work and works his plan. 

Method of Adjustment Illustrated. A good example of an index 
adjusted to United States census bench marks is the monthly 
index of pay rolls and employment published by the Bureau of 
Labor Statistics. The method of adjustment is reproduced by 
permission of the Bureau of Labor Statistics and is applied to a 
monthly index of pay rolls in the metal stamping, enameling, 
and japanning and lacquering industry of New Jersey.^ The 
adjustment is carried out on Form BLS 1238, June, 1940, pre¬ 
sented here in Table 78. 

The raw data, which have been adjusted for the 1937 and 
earlier census figures or bench marks, but remain to be adjusted 
for 1939 census data, are entered by months in columns (3), 
(9), and (13); the sums and averages {S and I) for each of these 
columns are then entered. In column (17), using the lower part 
entitled ^‘Formula if L is not available,enter the United 
States census figure for 1937 and 1939 {Zi and Zf). Calculate the 
ratio Zz/Zi, and enter in the space provided, therefore, 0.933280 

^ The work on New Jersey data was done by a Work Progress Administra¬ 
tion project sponsored by the New Jersey State Labor Department for the 
construction of monthly indexes of pay rolls and employment in manufac¬ 
turing industries, January, 1923-December, 1940. One of the authors was 
called upon to serve as consultant and director of the project. 

2 The part labeled ‘^Formula if L is available” is used with a blanket 
adjustment method involving several census periods. 
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in Table 78. Copy Si from column (3), in the space provided 
in column (17), that is, in Table 78, 883.82; this number mul¬ 
tiplied by the ratio Zz/Zi equals S3, entered in the space provided, 
in Table 78, that is, 824.85. 

In column (13) calculate Rz by finding for the year 1939 
[the n for each month is found in column (12)]. S7^/ = January 

nl + February nl + March nJ • • • + December nZ, includ¬ 
ing an nl for each of the 12 months. 

In column (18) enter S3, copying it from the last row of 
column (17). Enter S3, copying it from column (13). Subtract 
S3 from S3 and enter the difference in the next row of column (18). 
Copy Rz [from column (13)] in the next row of column (18). 
This value, Rz, divided into the figure in the preceding row, 
^3 ■“ gives the value of d. Enter d in the last row of column 
(18). This is the adjustment parameter. It is now used to 
adjust the series by months as follows: 

In columns (4), (10), and (14) enter 1 + nd for each month. 
These values should be obtained on a calculating machine as 
follows: Put 1.000000 in the machine, and add it. Put d on the 
keyboard, being careful to place it correctly for the decimal 
point. Subtract once, and record the answer for 1 + nd in 
January, 1937. Subtract twice more (making —3 altogether), 
and record the answer for 1 + nd in February, 1937. Subtract 
twice more (making —5 altogether), and record the answer for 
1 nd in March, 1937. Subtract once more (making —6 
altogether), and record the answer for I + nd in April, 1937. 
The values for 1 -f- nd in May, June, July, and August are the 
same as 1 + nd for April, March, February, and January, 
respectively. They can be found by reversing the above process 
on the calculator until 1.000000 remains in the machine and d 
on the keyboard. For September add d once and record 1 + nd 
for that month. Add d four more times (making 5 altogether), 
and record 1 + rid for October. By follomng a similar pro¬ 
cedure, guided by n in columns (2), (8), and (12), values of 
1 + nd are calculated and entered for each month through 
December, 1939. 

Enter in columns (5), (11), and (15) the indexes in columns 
(3), (9), and (13), multiplied, respectively, by the 1 + nd for the 
corresponding month in columns (4), (10), and (14). Add for 
each year, and enter sums, which equal K, and /S3. Divide 



Table 78. —Adjustment of Monthly Indexes to Biennial Census Figures 

Continuation method 

U.S. Bureau of Labor Statistics Form B.L.S. 1238 June, 1940 
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Statistics, Bulletins 610, (October, 1934), 3937 (November, 1936), 4189 (January, 1937), and 4382 (April, 1937), on revised indexes of factory employment. 
The revised indexes appear currently in the monthly bulletin on “Employment and Pay Rolls,” Serial No. R589, and the back figures are published in 
full from 1919 to date in Federal Reserve Bulletin, Vol. 24 (1938), pp. 838-866. Cf. also recommendations of the Committee on Government Statistics 
in “Recent Progress in Employment Statistics” by Aryness Joy, Journal of the American Statistical Association, Vol. 29 (1934), pp. 355-371. 
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eacE of these by 12, and enter the quotients in the next row. 
Thus, in Table 78, K = 883.88, = 654.19, and = 824.84; 

and divided by 12 these become 73.66, 54.52, and 68.74. 

In column (16) enter the Si and Ki, as indicated in the table; 
subtract K from Si, and enter the difference. Divide this dif¬ 
ference by 40, and enter the quotient in the next row. This 
figure is h, the parameter for the second adjustment. If Si — JK 
is smaller than 0.05, regardless of sign, do not calculate h. In 
the problem illustrated, Si — K = —0.06; hence h is calculated 
and found to be —0.002. 

If h is calculated, enter in column (6), for each month, mh; 
that is, in January enter h, in February enter 2/i, in March enter 
3A, in April enter Ah, in May to August, inclusive, bh ; thereafter 
declining each month, wdth 2h in November and h in December. 
Enter in column (7) the sum of the figures for the respective 
months in columns (5) and (6). The sum of column (7) is equal 
to Si. If h is not used, the sum of column (5) is taken as S[] that 
is, if h is ignored, K = Sj. 



CHAPTER XX 
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Elements of Variation in Time Series. The elements of 
variation contained in an ordinary time series may be illustrated 
by building up a hypothetical time series. 

The first element in the time series is ’"^1 

long-time growth, or trend. Peqple living \2 _ {- 

in the twentieth century are accustomed • 

to the idea that things grow, or progress. -—p 

Table 79, column (1), shows years and _* 

months for 3 years, and column (2) shows I 

a set of figures that grow at the constant 9--j-!- 

difference of 0.2 per month. This column ^ 
of figures is plotted in Fig. 135 {AA') and 
is a picture of the growth, or trend, in 7 
the hypothetical time series. 

Time series are also likely to have 
seasonal variations. Many economic and 5 
social phenomena vary from season to 
season in a similar manner each year. ^ 

This is most evident in the case of ^ 
activities affected by weather, such as 
agricultural production; but such patterns 2 
of seasonal variation occur in other events ^ 
as well. Suppose the seasonal variation T | I I 
in the hypothetical time series is such that 0 
November is usually 58 per cent above 135 —Two of the 

the average month, July is usually only component parts of a 

43 per cent as ^ large as the average =tnniaT“end!'^'' 
month, etc., as indicated in Table 80 = assumed trend, modi- 

showing the index of seasonal variation seasonal 

for the hypothetical time series. 

Figure 136 is a graph of this seasonal variation as it occurs year 
after year, 1943, 1944, and 1945. 
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Table 79. —Hypothetical Time Series Built Up 


(1) 

(2) 

(3) 


(6) 

Year and month 

Growth or 
trend 

Seasonal 
variation 
put in 

The cycle 


1943: 

January. 

1.0 

1.45 


1.45 

February. 

1.2 

1.49 

101 

1.50 

March. 

1.4 

1.41 

102 

1.44 

April. 

1.6 

1.33 

103 

1.37 

May. 

1.8 

1.19 

106 

1.26 

June. 

2.0 

1.04 

109 

1.13 

July. 

2.2 

0.95 

112 

1.06 

August,. 

2.4 

1.01 

115 

1.16 

September. 

2.6 

2.42 

120 

2.90 

October. 

2.8 

4.00 

122 

4.88 

November. 

3.0 

4.74 

124 

5.88 

December. 

3.2 

5.02 

125 

6.28 

1944: 

January. 

3.4 

4.93 

126 

6.21 

February. 

3.6 

4.46 

130 

5.80 

5.38 

March. 

3.8 

3.84 

140 

April. 

4.0 

3.32 

150 

4.98 

May. 

4.2 

2.77 

160 

4.43 

June. 

4.4 i 

2.29 

180 

4.12 

July. 

4.6 

1.98 

200 

3.96 

August. 

4.8 

2.02 

210 

4.24 

September. 

5.0 

4.65 

160 

7.44 

October. 

5.2 

7.44 

140 

10.42 

November. 

V • ^ 1 

5.4 

8.53 

112 

9.55 

December. 

^ • 1 

5.6 1 

8.79 

111 

9.76 

1945: 

January. 

5.8 

8.41 

109 

9.17 

February. 

6.0 

7.44 

107 

7.96 

March. 

6.2 

6.26 

105 

6.57 

April. 

6.4 

5.31 

103 

5.47 

May. 

6.6 

4.36 

101 

4.40 

June.. 

6.8 

3.54 

99 

3.50 

July. 

7.0 

3.01 

90 

2.71 

2.57 

August. 

7.2 

3.02 

85 

September. 

7.4 

6.88 

75 

5.16 

October. 

7.6 

10.87 

65 

7.07 

November.'. 

7.8 

12.32 

60 

7.39 

December. 

8.0 

12.56 

55 

6.91 
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Table 80 .-^Seasonal Variation 
(In percentages of the average month) 


January. 

145 

May. 

66 

September. 

93 

February. 

124 

June. 

52 

October. 

143 

March. 

101 

July. 

43 

November. 

158 

April. 

83 

August .... 

42 

December. 

157 








When the seasonal variation and trend are combined, a line 
like BB' in Fig. 135 is produced; the data are shown in column 
(3) of Table 79. To obtain each monthly value for the line 
BB' each monthly coordinate of the line A A', that is, the growth 
element in the time series, has been 
multiplied by the index of seasonal vari¬ 
ation for the corresponding month. This 
has the effect of redistributing the total of 
the 12 monthly figures of the growth line 
in such a manner as to make them prop¬ 
erly reflect the seasonal element. Thus 
the trend figure for January is multiplied 
by 1.45 (or 145 per cent) while the April 
trend figure is multiplied by 0.83 (or 83 
per cent). 

A third element of variation in time 
series is cyclical fluctuation, which may 
extend over several years. For example, 

Fig. 137 shows the rising and the falling 
movement of a cyclical fluctuation by 
months that occurs over a period of 3 
years; this is shown also in column (4) of 
Table 79. In column (5) of Table 79 and 
in Fig. 138 are shown the effect of combining also the cyclical 
movement. The figures for the respective months are now 
altered according to whether the cycle is carrying them upward 
or downward, and the percentage figures for the cycle, shown in 
column (4), depict this upward and downward swing of the cycle. 
The cycle is put . into the data by multiplying each monthly 
figure in column (3) by the corresponding monthly index of the 
cycle found in the same row of column (4). The results are 
shown in Fig. 138, which is the final hypothetical time series; 
the data for it are in column (5) of Table 79. 



Fig. 136.—Seasonal 
variation in the hypo¬ 
thetical time series. See 
Table 80. 
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Two important effects of combiniftg the growth element and 
the seasonal-variation element are noticeable from Fig. 135. 
In the first place, the combination has a tendency to obscure the 
trend. It is still clear in line that there is a rising tendency, 
but the wide sweeps of the seasonal fluctuations tend to conceal 
the exact nature of the rise; for without the line A A' in Fig. 135 it 
would be difficult to visualize precisely what the slope of this 
trend actually is. In the second place, the combination definitely 



1943 1944 1945 


Fig. 137.—The cycle in the hypo¬ 
thetical time series. See Table 79, 
column (4). 



Fig. 138.—All three component ele¬ 
ments of the hypothetical time series 
combined. See Table 79, column (5). 


distorts the shape of the seasonal variation, in two ways: (1) It 
causes the valleys and peaks to be thrown out of line arith¬ 
metically. (2) It minimizes the size of the seasonal variation 
where trend is low and exaggerates the size of the seasonal 
variation where trend is high. 

From Fig. 138 it is clear that the effect of including the 
cyclical movement is further to obscure the trend or growth 
element and to distort still more the character of the seasonal 
variation. It is in approximately this condition that most time 
series exist in their raw state. Raw data of time series contain 
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in varying degrees elements of all three of these types of fluctua¬ 
tion. Some have little seasonal variation, some have a great 
deal, and some have none. Many have rising trends following 
population and general growth, while a few have declining 
trends because they represent decaying or disappearing types 
of economic or social activities. In practically all time series, 
cycles of varying length and varying amplitude occur. 

In addition to the three elements illustrated by the hypo¬ 
thetical case, most time series contain fluctuations due to unusual 
or residual occurrences, such as the effects of floods, storms, or 
strikes. This gives, four elements or types of fluctuation and 
these four types of fluctuation serve as a good classification for an 
empirical start in the analysis of time series.^ 

GENESIS AND PURPOSES OF THE TIME-SERIES ANALYSIS 

The hypothetical problem just illustrated consisted in a 
synthesis. The study of time series is analysis—a reversal of the 
procedure that has just been demonstrated. This breaking up 
of time series into its constituent elements, and the various com¬ 
plications involved, constitutes the subject of time-series analysis. 

Why do economists, social scientists, and statisticians analyze 
time series? What started them along this line of procedure, 
and what are its advantages? The answers to these or other 
questions as to the significance of time-series analysis have in 
general a threefold basis: (1) interest in the population problem 
and the discovery of the law of organic growth, (2) concern for 
the general problem of the so-called ‘‘business cycle,’’ and (3) 
preoccupation with the variety of problems associated with 
seasonal influences upon business and social life. 

Rational Trends. Historical Background, In 1798, Thomas 
Robert Malthus, a minister of the gospel and a political econo¬ 
mist, wrote an Essay on the Principle of Population, in which he 
advanced the fundamental principle that the law of growth of 
population is geometric—population, he said, tends to grow in a 
geometric progression. The curve representing population 

1 This is the conventional classification of types of fluctuation that occur 
in time series; it was presented in detail by W. M. Persons of the Harvard 
Committee of Economic Research and published in the Review of Economic 
8tatistics, Preliminary Vol. 1. See also articles by the same author in the 
American Economic Review, Vol. 6 (1916), pp. 739“769, and Pvblicatiom 
of the American 8tatistical Assodaiion, Vol. 12 (1917), pp. 602-642. 
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growiih would accordingly be a positive exponential curve, 
similar in character to the curve representing the growth of a 
principal sum of money at compound interest. 

While some of the doctrines of Malthus regarding the controls 
to population growth are no longer accepted as tenable, the 
fundamental principle of the tendency of population to grow 
geometrically has not only been accepted with regard to popula¬ 
tion theory but has been widely applied in other fields. To 
people of the twentieth century this principle seems almost 
axiomatic, for they are familiar with the history of the nineteenth 
century, when the statistics show such a growth of population 
and such a development of many kinds of activities according to 
this principle of geometric progression. 

The principle of growth was not so obvious to those living at 
the time of Malthus, nor to those living in the middle of the 
nineteenth century. Consequently, it was startling and new to 
see the same principle applied to growth in certain economic and 
social phenomena, as was done by William Stanley Jevons, an 
English economist, in his celebrated book on The Coal Question 
(1865). Chapter IX of that book is entitled Of the Natural Law 
of Social Growth. In this he propounds the idea that many of 
the phenomena of economic and social life follow the same law 
of organic growth as population. In some, the progressive rate 
of geometric growth is greater than that of population; in some, 
less; but in all the growth is geometric. In another chapter of 
the same book, Jevons applied this principle and tested it with 
reference to England’s progress in industry. His contribution 
was of the nature of the proposal of a hypothesis that served as a 
challenge to mathematically minded economists like himself 
and others and soon stimulated the development of ideas as to 
how best to write the equation for the curve that would repre¬ 
sent growth of population. By such an equation, it was thought, 
population could be forecast far into the future as well as for 
intercensus years. 

Population Curves. In 1891, A. S. Pritchett suggested that 
an equation of the form P = a + ht + ct^ dP would fit the 
curve of population growth. The subject of equations for the 
population curve became one of wide concern to population 
students, economists, and scientists in general, as well as of prac¬ 
tical interest in obtaining accurate estimates of population 
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between the dates of taking the census. In 1907, Raymond 
Pearl proposed that the form of this equation should be 

P = a + bt + ct^ + d log L* 

The problem was again approached by G. Udny Yule, an English 
statistician, in 1925;^ and in later years the discussion was 
continued.^ 

Perhaps the most striking contribution on the subject is that 
of Raymond Pearl and Lowell J. Reed, who in 1920 advanced the 
idea that the population curve should not continue to rise 
indefinitely but should level off after some period of time and 
that thus the population curve showing the law of growth would 
not follow the compound-interest curve indefinitely. Rather, it 
would resemble the curve shown in Fig. 145 in Chap. XXI. 

The mathematical characteristics of this curve and its equa¬ 
tions are presented by these joint authors in the Journal of the 
Royal Statistical Society for 1927. f As will be clear from a glance 
at Fig. 145, the shape of this curve indicates about the growth of 
population or the law of organic growth that the first period 
of relatively slow arithmetical growth is followed by a period 
of very rapid arithmetical growth but that finally a period of 
slowing down of this rapid arithmetical growth occurs so that the 
curve at the top assumes an asymptotic character. 

Early Population Theories. Qu6telet remarked that Malthus^s 
doctrine resolved itself essentially to the proposition that, under 
the most favorable industrial circumstances, population could 
grow no more rapidly than in an arithmetical progression, 
although, of course, he stated the geometric law of growth as a 

* Knibbs, George H., “The Laws of Growth of Population,*’ Journal of 
the American Statistical Association, Vol. 21 (1926), p. 381. 

1 Journal of the Royal Statistical Society, Vol. 88 (1925), pp. 1—62, which 
contains an excellent historical summary of the problem of curve fitting to 
population growth. 

* Reed, L. J., and Raymond Peari^ “On the Summation of the Logistic 
Curve,” Journal of the Royal Stalistical Society, Vol. 90 (1927), pp. 729-746. 
The mathematics of the curve was discovered, say the authors, by Verhulst, 
according to Qu6telet writing in 1838, and was again applied to population 
by Pearl and Reed in 1920. Cf. Peard, Raymond, Studies in Human 
Biology (1924), Chap. XXIV, The Curve of Population Growth. 

t Op. dt. 
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tendency. It could grow only in arithmetical progression 
because it would be kept down to that rate by the fact that sub¬ 
sistence grows only in arithmetical progression. He also pointed 
out that the theory of population growth up to the time he was 
writing (1836) had not been developed to the point where it 
could be considered ^^dans le domaine des sciences math6- 
matiques, auquel elle semble sp^cialement devoir appartenir/'^ 
Even so, Qu4telet himself never went to the point of developing a 
mathematical equation expressing the law of population growth, 
although in other ways his contributions as a population theorist 
are outstanding. However, he did reach the point of suggesting 
that the law of population growth is like that of a body traveling 
through a resisting medium that tends to attain a limiting 
velocity. 

Yule suggested that this analogy probably inspired Verhulst, 
professor of mathematics at the ficole Militaire, to a controversy 
with Qu4telet on the subject. The problem of devising a 
mathematical law of population growth was actively studied by 
Verhulst for a number of years. He fitted logistic curves to the 
population histories of several countries for as many years as 
data were available, but the limited amount of data did not 
inspire confidence in the results.^ This work of Verhulst seems 
to have been forgotten until the time of the Pearl-Reed studies 
of 1920. Pearl and Reed’s discovery of the law of population 
growth in the mathematical form developed by them was 
independent. As Yule says, they seemed to have been unaware 
of the formulation by Verhulst. 

Basis for Rationalizing Trends. The attempt is made by 
students of the law of population growth to rationalize the 
fitting of such a logistic curve to experienced growth of popula¬ 
tion in many parts of the world, and at different times, by basing 
their reasoning upon the following points: 

1 Sur Vhomme (Bruxelles, Louis Hauman et Comp. 1836), pp. 283, 287. 

* Notice sur la hi que la 'population suit dans son accroissement (correspond- 
ance math4matique et physique puhli6e par A. Qu4telet, 1838), tome 10 
(also numbered tome 2 of the third series), pp. 113-121; and by the same 
author, ‘‘ Recherches math4matiques sur la loi d'accroissement de la popula¬ 
tion,” Nouveaux mi'moires de VAcad^rme Royah des Sciences et Belles-Lettres 
de BruxeUes, tome 17 (1846), pp. 1-38; ^‘Deuxi^me m^moire sur la loi 
d’accroissement de la population,” ibid.^ tome 20 (1847), pp. 1-32. Citations 
from G. Udny Yule, op, cU.j p. 57. 
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1 . The construction of such a curve through the plotted points 
showing actual population growth in a large nunjber of places 
produces a good fit. 

2 . Biological experiment under controlled conditions, with 
other species than man, produces increases in numbers in a 
manner following such a curve. Thus Pearl made such an 
experiment with fruit flies under controlled conditions.^ 

3. Studies of trends in birth rates and death rates, in their 
relation to population growth, appear to fit into the theory that 
the law of population growth follows this curve. 

4. Studies of death rates by age distribution of the population 
and the relationship between age composition and total death 
rate and birth rate of a population appear to fit into the law of 
population thus formulated.^ 

5. While it is true that the parabolas of earlier writers fit 
empirically the population growth wherever tried, such a curve 
fit cannot be rationalized, because the extension of the parabola 
goes on to infinity. On the other hand, the logistic curves of the 
Verhulst, Pearl-Reed, or Gompertz variety approach a limit in an 
asymptotic manner, which seems to be a more rational manner in 
which to view the law of population growth. 

6 . The asymptotic limit that it is assumed population is 
approaching can be closely approximated by study of the circum¬ 
stances surrounding the determination of the factors influencing 
population growth. 

Thus, it is recognized in this theory of the law of population 
growth that should technological changes comparable with the 
industrial revolution occur, the asymptotic limit might have to 

^C/.^ Pearl, R., The Biology of Death, pp. 253-254. Cited in Yule 
op. cit,, p. 22. 

* These ideas have reached the general public as well as the scientific 
group, through such articles as Robert A. Kuczynski, ^‘The World^s Future 
Population,” The New Republic, May 7, 1930; Aaron Hardy Ulm, “Our 
Falling Birth Rate Is Studied by Experts,” The New York Times, Mar. 2, 
1930; Louis I. Dublin (Statistician of the Metropolitan Life Insurance Com¬ 
pany), “America Approaching Stabilized Population,” The New York 
Times, Mar. 4, 1930; and by the same author, “Our Aging Population: Its 
Vital Effects,” The New York Times, Jan. 4, 1931. Cf. also Dublin, 
Louis, I., and Alfred J. Lotka, “On the True Rate of Natural Increase,” 
Journal of the American Statistical Association, Vol. 20 (1925), pp. 305-339; 
and Dublin, Louis I., “The Statistician and the Population Problem,” 
ibid,, Vol. 20 (1925), pp. 1-12. 
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be raised and that the law of population growth over a period of 
centuries may be conceivably a series of ogive-like cycles. 

Criticism of Rationalized Trends. However, this rationalistic 
view of curve fitting to population and the attainment in this 
manner of a mathematical law of population growth have not 
gone unchallenged. Prof. A. L. Bowley, an outstanding English 
statistician, says, regret that so much prominence has been 
given to the logistic equation. It certainly has the merit, and 
the danger, of mathematical neatness, and it expresses what may 
be regarded as a fundamental law of population—that is, that 
population cannot increase indefinitely in constant geometric 
progression. There is, however, no reason a priori to suppose 
that the damping down of the increase is of so regular or uniform 
a nature that a mathematical function of the same form repre¬ 
sents it in all times and in all places, and none a priori to justify 
the use of a linear term (out of all possible functions) for this 
purpose. We should rather anticipate that the form of the 
function would be neither general nor linear. The justification 
for the logistic form is purely empirical, and, in fact, we are asked 
to accept it because it does give results which agree with the 
records of certain populations. Any other curve which gives as 
good an agreement has similar claims for representing the series 
of records. The closeness of the agreement is, I think, unduly 
accented by the very small vertical scales used by Dr. Pearl and 
Mr. Yule. . . 

T. H. C. Stevenson, another English statistician, rather 
prosaically declares that he finds sufficient explanation, without 
resort to logistic curves, for the rapid decline in birth rates since 
the end of the nineteenth century, in the dissemination of knowl¬ 
edge of contraception. 2 

More recently, the whole question of the rationality of curve 
fitting was taken up in an admirably thorough manner by 
George H. Knibbs, whose findings are apparently that the 
mechanical process of the curve fitting is empirical and must be 
accepted as empirical but that the law of population growth 
may be conveniently expressed by such equations when it is 

1 From remarks on Yule^s paper, op, dUj p. 76. 

2 ‘‘The Laws Governing Population," Journal of the Royal Statistical 
Society^ VoL 88 (1925), pp. 63-76. 
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thoroughly understood how those equations apply, and also 
their limitations.^ 

✓ 

It is natural to scientists to be skeptical, particularly of other 
scientists' startling discoveries, and the student of social science 
must get used to such controversies and pick and choose for 
himself what he believes to represent progressive development 
of human knowledge and what merely overzealous creative 
imagination. It is in these attempts to explain phenomena that 
the progressive development of human knowledge occurs. 

Application of Rational Trends to Social Philosophy, It was 
pointed out above that Jevons had advanced the hypothesis 
that the law of organic growth applies also to social and economic 
phenomena. Following the example of the population curve¬ 
fitting group, scientific curiosity turned to the discovery of a 
rational conception of curve fitting to social, biological, and 
economic phenomena in order to replace purely empirical 
methods. As Wesley C. Mitchell has pointed out,^ ‘‘A step 
toward such a conception is represented by the frequent inter¬ 
pretation of certain trend lines as showing the ^growth factor.' 
Statisticians dwell with satisfaction upon their demonstrations 
that certain industries have expanded decade after decade at a 
substantially uniform rate, or at a rate which has changed in 
some uniform way. They take almost as much pleasure in con¬ 
templating the somewhat similar rates at which different indus¬ 
tries have grown in given periods and countrieg. Nor are they 
at a loss for explanation of these uniformities. In view of the 
increase in population characteristic of the great commercial 
nations and of the advance in industrial technique, it seems 
scarcely fanciful to think of modern society as ‘tending' to 
produce an ever larger supply of goods for the satisfaction of its 
wants. On this basis, cyclical fluctuations appear as alternating 
accelerations and retardations in the pace of a more fundamental 

^ ^*Laws of Growth of Population,” Journal of the American Statistical 
Association, Vol. 21 (1926), pp. 381-398; and Vol. 22 (1927), pp. 49-59. 

2 Business Cycles—The Problem Stated and Its Setting, (1928), pp. 221-224; cf, 
Prescott, Raymond B., “Law of Growth in Forecasting Demand,” (1928), 
Journal of the American Statistical Association, Vol. 17 (1922), pp. 471-479. 
Later, Leroy E. Peabody fitted such a curve to railway traffic in the United 
States, “Growth Curves and Railway Trafiic,” Journal of the American 
Statistical Association, Vbl. 19 (1924), pp. 476-483. 
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process. Secular trends, in short, are taken to measure economic 
progress generation by generation. 

“A bold speculation of this sort has been ventured by Raymond 
B. Prescott. He suggests that perhaps 'all industries, whose 
growth depends directly or indirectly upon the ability of the 
people to consume their products,’ pass through similar phases 
in the course of their development. Four stages seem to be 
common. 

1 . Period of experimentation. 

2 . Period of growth into the social fabric. 

3. Through the point Avhere the growth increases, but at a 
diminishing rate. 

4. Period of stability. 

“On this basis, Prescott suggests that the secular trends of all 
such industries may be represented by a single type of curve— 
that yielded by the Gompertz equation. Every country may 
have a different rate of growth, and so may every industry, 
because no two industries have the same combination of in¬ 
fluences. They will trace the same type of curve, however, 
even though the rate of growth is different.” 

More recently, an ambitious and carefully studied attempt to 
rationalize the whole subject of trends in economic phenomena 
was made by Simon S. Kuznets, of the National Bureau of 
Economic Research.^ Kuznets analyzes the various factors 
making for gro\^th, and also making for slowing up of growth, 
under the following items: 

1. On the side of growth: 

Population increase. 

Changes in demand. 

Technological changes. 

2. On the slowing up of growth: 

Slackening of technological progress. 

Retarding influence of other slower industries. 

Funds available for expansion decrease in relative size as industry 
grows. 

Competition of later developing industries in other countries. 

^ Secular Movements in Production and Prices (1930); in 1943, Kuznets’s 
work is still the best statistical study of this type. For more recent trend 
studies of a different type, see Edwin Frickey, Economic Fluctuations in the 
United States^ 1866“1914 (1942), and Norman J. Silberling, The Dynamics 
of Business (1942). These studies use subjective methods for analyzing 
trends and cycles. 
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Kuznets fits logistic curves to a large number of production 
series and also fits appropriate curves to the corresponding price 
series. It should be noted that this type of rationalization does 
not apply to price series, and as a rule the curves that Kuznets 
fits to his price series were merely parabolas and represented 
empirical trends. One of the most interesting results of his 
work is his discovery and analysis of secondary trends/^ 



Fig. 139.—Production of Portland cement in the United States with logistic 
trend line, 1880-1924. 

Thus, from a large variety of data, he took out the long-time 
growth, upon the assumption of the existence of a logistic growth 
element, and he found, not only cycles, but also longer wavelike 
movements of 11 to 20 years. This is illustrated by Figs. 139 to 
141, reproduced from his book and showing the type of analysis 
as applied to cement production and prices, 1880-1924.^ As 
seen in Fig. 139, the heavy line represents the logistic curve, and 
there are long sweeps of the actual data in waves above and below 

' Op. cU.j pp. 100-101, reproduced by permission of the author. 
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this growth curve, as well as cyclical movements of shorter 
duration. Figure 140 shows a parabola fitted to the course of 
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Fig. 140.—Factory prices of Portland cement in the United States, original data 
and primary trend, 1880-1924. 


prices of cement during the same period. Figure 141 shows the 
long wavelike movements in production and in prices, with the 
relative fluctuations of the actual data above and below these 



1860 1890 1900 1910 1920 1930 

Fia, 141.—^Production of and prices of Portland cement in the United States, 
1880-1924. Secondary trends and minor cycles. 

secondary trends. Kuznets calls the logistic growth curve the 
/^primary trend line,^' and the heavy, black, wavelike line in 
Fig. 141 represents the secondary trends of the production of 
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cement. The actual data fluctuate above and below these 
secondary trends in major and minor Cycles. 

Before the publication of Kuznets’s work these'longer move¬ 
ments had been studied by C. A. R. Wardell.^ Wardell called 
the movements “major cycles’’ rather than secondary trends, 
and his method of analysis was quite different from that followed 
by Kuznets. He also attempted to give an explanation of the 
major business cycle that Kuznets rejects.^ In 1927, also, there 
appeared in Russian a discussion of the whole problem of major 
cycles, which contains a report by Kondratieff and a counter¬ 
reply by D. T. Oparin. To explain these major cycles Kon¬ 
dratieff developed the theory that they are essentially cycles 
of expansion and contraction in the growth of the basic capital 
equipment of a country. 

Thus, starting with the desire to define the law of population 
growth more precisely and to bring the population problem into 
the realm of mathematical treatment, scholars have carried on 
by analogy into other fields; so far as economics is concerned, 
the principal result so far is the discovery of these long wavelike 
movements. Not only do the theoretical economists need to 
explain the old-fashioned business cycle (which was always a 
rather vague concept), but they now are challenged to explain 
(1) secondary secular movements or major business cycles, (2) 
ordinary business cycles, and (3) minor business cycles. The 
analysis of time series, then, must include some additional types 
of fluctuations from those described in a preliminary manner at 
the beginning of the chapter. 

The following classification of movements is now suggested.® 

1. Trend, or long-time growth, which appears to be logistic in 
character and for which a mathematical formula may be rational, 

2. Cyclical movements of three types, for which a rational 
mathematical formula is not appropriate. 

а. Secondary secular movements or major cycles. 

б . Cycles (the old theoretical business cycle). 

c. Minor cycles. 

^ An Investigation of Economic Data for Major Cycles, Thesis (University 
of Pennsylvania, 1927). 

* Op. di,, pp. 265-266. 

3 Of. classification suggested by Prof. Willford I. King, which is similar, 
in “Principles Underlying the Isolation of Cycles and Trends,” Journal of 
the American Statistical Association, Vol. 19 (1924), p. 468. 
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3. Seasonal variations, for which a mathematical formula is not 
rational. 

4. Irregular fluctuations, such as those due to wars, epidemics, 
floods, or strikes. These are called residual fluctuations^ and 
may follow the normal curve. ^ 

Empirical Trends. Trend analysis, that is to say, the applica¬ 
tion of mathematical processes in order to obtain equations 
describing direction of movement of a time series, may be 
applied, not only for rational ends indicated in the discussion of 
the law of organic growth, but also empirically where no a priori 
knowledge about the character of growth or law of movement or 
trend exists. Indeed, the search for such a law may have no 
bearing on the analysis; the trend may be sought for the purpose 
of isolating and studying cyclical movements. When trends are ^ 
found without seeking to verify some hypothesis concerning a 
law of growth but merely with respect to given data, they are 
empirical trends. 

Application of Empirical Trends to Cycle Analysis, The 
third factor mentioned at the beginning of this chapter as a 
force stimulating statistical analysis of time series has been the 
abstract study of the business cycle. Such abstract analysis 
has challenged the mathematical economist and the statistician 
to discover and to apply methods of statistical analysis that 
would measure the cycle. 

Economic history of the modern era has been one of alter¬ 
nating periods of relative prosperity and relative depression 
and has also been characterized by periods of more or less violent 
speculative activity. The Mississippi Scheme and the South 
Sea Bubble burst in France and England in 1720, and there 
occurred commercial crises of major importance in 1763, 1772, 
1783, and 1793. During the eighteenth century these recurring 
periods of crisis excited much discussion, but eighteenth-century 
’writings dealt mainly with the dramatic surface events and did 
not develop a theory explaining them. And indeed the funda¬ 
mental principles of economics were not formulated until the 
latter half of the eighteenth century. The publication of Adam 
Smith’s Wealth of Nations in 1776 is usually taken as the debut 
of economics as a science. 


» See pp. 285-297, and 570, 648. 
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While a group of economists following Adam Smith developed 
a theoretical explanation of the operation of economic forces 
under normal conditions, or in the long run, another group that 
assumed the role of critics of the economists'’ developed 
theories of the business cycle. These were such men as Sismondi, 
Rodbertus, and Karl Marx. J. C. L. Simonde de Sismondi, an 
Italian Swiss, had originally been a thorough convert of Adam 
Smith and laissezfaire and had become the Continental expositor 
of his theories; but as he said, writing in 1818 and referring to 
the depression of 1815-1817, he was deeply affected by the com¬ 
mercial crisis that Europe had experienced and by the cruel 
sufferings of the industrial workers that he had witnessed in 
Italy, Switzerland, and France and that all reports showed to 
have been at least as severe in England, in Germany, and in 
Belgium.^ He set about developing a theory to explain the 
recurrence of such periods, and in his work are found many of 
the ideas current even today concerning the origin and explana¬ 
tion of the business cycle. He suggested that the business cycle 
is due to the faulty organization of the capitalist system and that 
the system is planless and therefore needs planning. He also 
suggested the explanation that what is needed is a better dis¬ 
tribution of income. He suggested the oversaving hypothesis. 
His principal explanation is the inequality in the distribution of 
incomes resulting in glutting of the markets and the production 
of crises and depressions. 

The idea that commercial crises are cyclical in character 
evolved early in the nineteenth century; some even went so far 
as to advance the theory that they occur every 7 or every 11 years. 
In 1875, this led the economist and statistician, W. S. Jevons, to 
propound a theory that the business cycle is due to cycles that 
occur in sun spots, which it had been discovered have a rhythm 
of about 11 years.^ During the latter half the nineteenth cen¬ 
tury a number of statistical attempts to discover the business 
cycle were made. The attempts used the idea of smoothing 
out the irregular fluctuations in the curves of raw data and 

^ Mitchell, op, cit., pp. 4-5. The historical material here given on the 
business cycle is taken principally from this source. 

® For a more complete discussion of the history of business-cycle theory 
than it is possible to give here, see ibid, and also Ernst Wagemann, Economic 
Rhythm, (1930), either of which contains further bibliographical references. 
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thereby clarifying the cyclical movements. The earliest 
examples of such statistical work appear to be in 1884.^ 

Both Jevons and the later experimenters of the nineteenth 
century were content with attempts to discover cyclical move¬ 
ments in separate individual series. In 1909, Beveridge in 
England; in 1911, Julin in France; and in 1913, Mortara in 
Italy conceived the idea of conabining a number of series into a 
composite statistical measure of the business cycle. The work 
of carrying out this task was then largely taken over by the 
American statisticians, in the construction of the so-called 
“barometers” of business conditions that have been described 
in Chap. XIX Index Numbers. The period up to about 1914 
may be characterized as one during which interest in the subject 
of the business cycle was intense. Economists were in sharp 
controversy with the business-cycle theorists—denying emphat¬ 
ically the implications that they drew from their analysis of the 
statistics available and from their theoretical explanations of the 
business cycle. At the same time, the disturbing theories of 
the business-cycle students had greater claim to general interest 
because they touched upon a more vital and present thing than 
was customarily dealt A\ith by the conventional economist. 
The conventional economist was explaining how things tend to 
happen under normal conditions, and the business-cycle theorists 
loudly proclaimed that we never live under “normal conditions” 
and that the theories of the economist were therefore useless. 
At the same time, the interest of the practical businessman was 
aroused by the desire for knowledge of the relationship between 
his own particular business and the general business cycle. 

Development of Technique for Time-series Analysis. The 
pressure to develop a statistical technique to analyze the prob¬ 
lem was thus very great, and the accumulation of available 
statistical material to analyze had been rapid for a. number of 
years. The technique that developed assumed two general 
characteristics, one of which has since been extensively used, 
the other less frequently. 

The first method of technique that developed was the discovery 
statistically of the cycle in time series by the removal of the 

1 PoTNTiNG, J. H., and R. H. Hooker, Comparison of the Fluctua¬ 
tions in the Price of Wheat and in the Cotton and Silk Imports into Great 
Britain,” Journal of the Royal Statistical Society, Vol. 47 (1884), pp. 34-64. 



RATIONAL BASIS OF THE ANALYSIS OF TIME SERIES 561 

trend from a series of annual data. ^ Trends were fitted empir¬ 
ically to the data by the method of least squares qt some other 
method ^most commonly by the method of least squares—using 
relatively short periods of time, say 9 to 19 years. The cyclical 
movements theii were the measures of the movements of the 
data above and below the empirical trend. Prof. Willford I. 
King said, ^^Any particular type of fluctuation in which we 
happen to be interested can be successfully studied only when 
most of the other kinds of fluctuations have been eliminated.^’^ 

This is, of course, the raison d^Ure for the empirical trend 
analysis, which is primarily for the purpose of isolating the ordi¬ 
nary and the minor cycles. The major cycles or secondary 
secular movements are best studied by the Kuznets methods that 
have been described and illustrated. The methods of analysis 
used are essentially similar to those employed in empirical trend 
analysis, but the Kuznets logistic trend lines may be rationalized 
in terms of a law of organic growth. 

The second method of technique that developed was the 
attempt to apply harmonic analysis or the periodogram to series 
of economic data, a different application of the method of least 
squares. This was the work of Henry L. Moore of Columbia 
University in his application of Fourier’s theorem, the mathe¬ 
matics of which Fourier had developed a century ago in his 
ThSorie des mouvements de la chaleur dans les corps solides and 
for which he was feted by the Academie des Sciences in 1812. 

Prof. Moore applied the mathematics of the periodogram to the 
records of rainfall in the corn belt of the United States, working 
out the periodogram equations for the cycles of rainfall; he 
discovered similar cycles in crops and introduced the harmonic 
analysis into modern statistical method. He says:^ ‘^The prin¬ 
cipal contribution of this essay is the discovery of the law and 
cause of economic cycles. The rhythm in the activity of eco¬ 
nomic life, the alternation of buoyant, purposeful expansion 
with aimless depression, is caused by the rhythm in the yield 
per acre of the crops; while the rhythm in the production of the 

^ Journal of the American Statistical Association^ Vol. 19 (1924), p. 468. 

® Economic Cycles: Their Law and Cause (1914). Cf. Crum, W. L., 
“Periodogram Analysis,” Chap. -XI in H. L. Reitz, Handbook of Mather 
medical Statistics (1924). Also BnuNf, David, The Combination of Observor 
lions (1931), Chaps, XI and XII, 
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crops is, in turn, caused the rhythm of changing weather, 
which is represented by the cyclical changes in the amount of 
rainfall. The law of the cycles of rainfall is the law of the cycles 
of the crops and the law of economic cycles,’’ 

The mathematics of the harmonic analysis are somewhat com¬ 
plex, and this method has not attained the popularity that 
has been attached to the removal of empirical trend by using 
straight lines or second- or third-degree polynomials, where the 
mathematical analysis involved is quite simple. 

Use of Functions of Arc Tangent and Orthogonal Polynomials 
in Trend Analysis. In recent years two modified forms of the 
conventional trend analysis by the method of least squares have 
been developed. In 1928, it was suggested that the inverse 
trigonometric function, or arc tangent, could be adapted to 
measuring trends in series behaving in the following manner:^ 

L A downward tendency approximating a straight line but 
of such nature that projection of a straight line into the future 
would lead to absurd results, that is, negative or ridiculously 
small positive values when comparatively large positive values 
only are possible. 

2. Approximately a linear growth or decline, followed by an 
abrupt change in level (rise or drop) and subsequent resumption 
of the early tendency. 

3. Approximately a straight-line trend interrupted by a sharp 
rise or drop, followed by another abrupt change in level and 
subsequent resumption of the early movement at the same or a 
different level. 

The method was used successfully in fitting a trend to the 
annual prices of International Paper common stock for the period 
1900-1926 and to the annual index of wholesale prices in the 
United States, 1900-1928. 

The orthogonal analysis is a method invented for reducing the 
amount of arithmetical calculation involved in fitting poly¬ 
nomials to time series by the method of least squares, especially 
second- and third-degree polynomials or polynomials of higher 
degree. The fitting of a polynomial of higher than second degree 
to a time series involves laborious calculations, particularly if a 
considerable period of time is covered. This laborsaving method 

^Cabmichael, F. L., ^‘The Arc Tangent in Trend Determination,” 
Journal of the American Statistical Assodalion, Vol. 23 (1928), pp. 253-262. 
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is described in detail, together with tables of values to facilitate 
its use, in Chap. XXII. ^ 

^See pp.. 600-615. Also c/. Jokdan, Charles, “Approximation and 
Graduation According to the Principle of Least Squares by Orthogonal 
Polynomials,”. The Annals of Mathematical Statistics, Vol. 3 (1932), pp. 257- 
357. C/. Romanovsky, V., “Note on Orthogonalizing Series of Functions 

and Interpolation,” Biometrika, Vol. 19 (1927), pp. 93-99; Jordan, Charles, 
“Sur une s4rie de polynomes dont chaque somme partielle repr4sente la 
meilleure approximation d’un degr4 donn4 suivant la m4thode les moindres 
carres,” Proceedings of the London Mathematical Society, Vol. 20 (1921), pp. 
297-325; and Dieulefait, Carlos E., “La determinaci6n de la tendencia 
secular en las series econ6micas,” Gabinete de Estadietica, Rosario, Argentine 
Republic (Santa Fe), Universidad Nacional del Litoral (1932), pp. 1-52. Cf. 
Fischer, R. A., Statistical Methods for Research Warkers (4th ed., 1932), 
pp. 133-142. 
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TREND ANALYSIS 

Empincal Trend vs. Rational Trend. Both empirical and 
rational trends are obtained by analysis from raw data; the 
difference between the two is that a rational trend can be 
explained in terms of long-time growth or decline, whereas an 
empirical trend has no meaning per se. The empirical trend is 
a useful tool of analysis, as will be seen in the ensuing discussion. 

In the preceding chapter the attempt was made to convey the 
idea that a rational trend is one that is found for its own sake; 
it has a rational explanation and is useful as a method of inter¬ 
pretation in itself. While it may be true that the rationaliza¬ 
tion that is made with respect to such trends is preliminary or 
even tentative, nevertheless the original intent is to make a 
rational use of them. Empirical trends are those for which there 
is admittedly no rational basis at the start, being used merely as 
a convenient method of removing from the data longer time 
movements that obscure the shorter time cyclical fluctuations. 

Empirical trends in themselves may have no rational sig¬ 
nificance as a description of any type of long-time growth, or 
movement. An empirical trend calculated for a period of 9 years 
at a point in time coincident with the peak of a secondary secular 
movement would presumably be in the form of a parabola. At 
another point in time, a 9-year trend analysis may give a straight 
line, or a logarithmic line. If a trend line happened to be cal¬ 
culated for a period of time from the low point of a secondary 
secular movement to the high point of another, the empirical 
trend might assume the form of a Verhulst growth curve; but 
it may have no such significance as a law of growth in that case, 
being simply the result of happening to take an empirical trend 
for that period of time. An examination of the heavy black 
curve representing the secondary trends in cement production 
(Fig. 141, page 556) vill help to make clear what is meant by these 
statements. 
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Detecting Cycle by Removing Empirical Trend, While empirical 
trends may have no rational significance per se, the fitting of 
an empirical trend to the annual data of a time series will make 
it possible to isolate the residuals from the trend. These 
residuals constitute the cycles and minor cycles of the period 
analyzed. The first clear statement of the analysis of time 
series by this method was made by W. M. Persons in 1915.*^ 
The method is illustrated by examples at the end of this chapter. 

Thus the function of empirical trend analysis is to obtain an 
approximation to some longer term movement for the purpose of 
eliminating this in order to study shorter term movements of a 
cyclical or accidental character. The. empirical trend may 
approximate a segment of a long-term cyclical movement, or it 
may approximate a portion of long-term growth in the series 
that might itself have rational explanation. What the empirical 
trend measures depends upon the circumstances in each problem, 
and the discovery of the rational nature of an empirical trend 
depends upon a priori knowledge. 

Methods of Fitting Trend. Three methods of fitting trend to 
time series can be distinguished: (1) the method of least squares, 
(2) the method of selected points, and (3) the method of 
averages. 

Fitting a Trend Line by the Method of Least Squares. Figure 142 
represents a plane in which there are seven points, Pi, . . . , Pi. 
To simplify the arithmetic an uneven number of points is taken, 
and the middle point is selected for the location of the i/-axis.2 
Accordingly, t varies from —3 to +3. The coordinates of the 
points, as may be observed from the figure, are as follows: 

Pi(t = -3, yi) Pi{t = -2, yi) Pi(t = -1, ys) 

Pi(t = 0, 3 / 4 ) Pi(t = 1, 3 / 5 ) Pt(t = 2, yi) Pi{t = 3, 3 / 7 ) 

^ American, Economic Review^ December, 1916, pp. 739-769; Pvhlications 
cf the American Statistical Association, June, 1917, pp. 602-642; Harvard 
Review of Economic Statistics, Preliminary Vol. 1 (1919). Cf. Mitchell, 
W. C., Business Cycles — The Problem Staled and Its Setting, pp. 200,212-213, 
328^30. 

* For statistical purposes it is more convenient to take a more recent year 
as the time origin than that of the birth of Christ. Thus, if a given set of 
data run from 1927, say, to 1937, it might be convenient to choose 1932 as 
the zero year. If this were done, then 1933 would be ^ = 1, 1935 would 
be f = 3, 1929 would be < = ~3, etc. 
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The corresponding points on the straight line to be found, for 
example, point A in the figure, may be represented by the 
following coordinates: 

= "” 3 , 2 / 1 ) {t = — 2 , 2 / 2 ) (t = — 1 , 

{t = 0 , 2 / 4 ) ~ 2 / 5 ) “ 2 , 2 / 0 ) (f — 3 , 2 / 7 ) 

The general form of the equation for a straight line in a field of 
coordinates y and tiE y = a and for this line the equation 
is as follows: 

yf = a+ ht ( 1 ) 

The line is determined for the particular case by finding values 
of a and h. 



The line that is sought is the one from which the sum of the 
squared deviations of the points from the line is less than such a 
sum with respect to any other line. This is the least-squares 
criterion. 

The vertical residuals of particular points from the line are 
as follows, as illustrated in Fig. 142 for P^: 

ri = yi- y[ 
ri = yi- y'i 

r» = y» — y'e (illustrated by Ps in Fig. 142) 
tt = yi — y'j 

Some of these variations (designated as r) are negative, for 
example, at P 7 , while others are positive, as at When 
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squared, however, they are all positive and the conditions that 
must be satisfied according to the least-squares criterion for a 
line that will best fit these points is that = minimum, in 
other words, that 

— y'Y = minimum (2) 


The value of y', from Eq. (1), may be substituted; Eq. (2) then 
becomes 

'Z{y — a — btY = minimum (3) 

The condition under which Eq. (3) is true is that the total 
differential is equal to zero, in other words, that 


d(Sr2) 


3(2r^) 

da 


da -\- 


d(SrO 

dh 


db 


= 0 


Inasmuch as da and db cannot be equal to zero, this gives the two 
conditions that^ 


^ ^ 2(2/ - a - bty = 22(2/ - a - bt) =0 | 

^ 2 ( 2 / - a - MY = S 2 <( 2 / - a - bt) =0 

and hence the following two equations, by canceling out the 2’s 
and carrying out the summations: 

22/ = iVa -h bM (i) 

2<2/ = aLt + b'St^ (ii) 

In these two equations, all the terms are knoAvn, except a and 
b; because 2f = 0 and 22 / is the sum of the known y’a of the 
seven points Pi, ... , P 7 . The 2 P is 

9 - 1-4 + 1 + 0 - 1 - 1+4 + 9 


Because 2t = 0, values for a and b can be found as follows: 

a = ^ from Eq. (i) 

b = ^ from Eq. (ii) 

1 In the case under consideration, it is not necessary to be concerned with 
the possibility that these same conditions might also hold true for a maxi¬ 
mum or a TnmiTmiTn j sin fifi the conditions of the problem indicate that it 
is a minimum. 
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Accordingly, the equation for the line of best fit, by the 
criterion of least squares, is as follows: 


,./ _ % I '^ty f 


( 5 ) 


Numerical Illustration. As a more concrete illustration, 
values will be assigned to the y’s of the seven points, as follows 
(< coordinates remaining as before): 

Pliy = 5) P^{y = 2) P,{v = 7) P,{y = 4) 

P^{y = 6> P,{y = 10) P,(t/ = 8) 

An orderly work sheet will be set up in order to find 2y, ^ty, 
and JV, of course, is equal to 7. 


Work Sheet for Finding Best-fitting Straight Line for Seven 

Given Points 


t 

y 

ty 

<2 

-3 

5 

-15 

9 

-2 

2 

-4 

4 

-1 

7 

-7 

1 

0 

4 

0 

0 

1 

6 

6 

1 

2 

10 

20 

4 

3 

8 

24 

9 



50 


Zt = 0 

Sj/ = 42 

-26 

= 28 



M 

11 



The equation for the best-fitting line according to the least 
squares criterion is therefore as follows [see Eq. (5)]: 

y' = ¥ + 

or 

= 6 + o.m 

It will be well to note what the equation says. First, with 
each unit increase of t, the line (that is, the value of j/') rises by 
0.86. This value, 0-86, is called the ‘‘slope’' of the line; and it 
is the tangent of the angle that the line makes with the ^-axis 
or with any line parallel to the ^-axis. Second it says that, 
when ^ = 0, y' = 6. This means that the line passes through 
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the y-^BiKis at a point -1-6 from the i-axis (when the y-axis is 
located at the middle point in time). 

If the 2 /-axis were shifted from its present location to the 
position ^ = —3, everything else remaining in its original 
position, the value of the t coordinates of all the points P will 
change to accord with the new location of the y-axis. Also, it is 
to be noted that the above equation would then become 

2 /' = [6 - 3(0.86)1 + OMt 

or ' 

y' = 3.42 + OMt 

since 3.42 will be the intercept on the new y-axis. 



Fitting Second'- or Third-degree Curves, Second-, third-, or 
even high-degree curves may similarly be fitted by the method 
of least squares. It may happen that the points are distributed 
in such a manner that a straight line does not fit. For example, 
Fig. 143 shows seven points that would be better fitted by a 
parabola. The general form of the equation for such a curve is 

= a + 6^. + cP 

The equations for finding values of a, 6, and c, for such a best¬ 
fitting parabola, are worked out on precisely the same principles 
as those for finding a and 6 for the best-fitting straight line.^ 
That is to say, the equation = a + 6^ + cP is fitted to the 
points so that 

'Z{y — y^Y = minimum (6) 

and when the value of y^ is substituted in this equation, it 
becomes 

^For a better method of fitting polynomials by the method of least 
squares, see Chap. XXII, Orthogonal Polynomial Trends. 
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'2(y — a — bt — ct^Y = minimum (7) 

When this expression is differentiated with respect to a, 6, 
and c, following the same method as in Eqs. (4), (i), and (ii), 
the equations for finding a, 6, and c are obtained, as follows: 

22/ = iNTa + 62^ + c2^2 (i) 

2^2/ = a'Lt + 62^2 ^ (ii) 

= a2^2 + 62/3 + (iii) 

A work sheet such as the following form (leaving out columns for 
the uneven powers of /; they will presumably all be zero since 
the zero value of t is selected in the middle of an odd number of 
years) is used for finding values of a, 6, and c. 


Work Sheet for Finding Best-fitting Parabola for Seven Given 

Points 


t 

V 

ty 



t* 





mmiin 






mmHH 








s/ = • • • 

22/ = • • • 

II 



= • • • 


Since 2/ = 0, when the sums of the columns in the work 
sheet are substituted in Eqs. (i), (ii), and (iii) above, the three 
unknowns a, b, and c may be found by solutions of these. 

Probability Theory Is Not Applied. It must be remembered that 
the application of the least-squares criterion for obtaining the 
line that best fits a time series does not involve the application 
of the theory of least squares in the sense that the trend line 
obtained is a most probable line, expressive of a law of move¬ 
ment or growth in the probability sense. ^ As originally applied, 
the theory of least squares had a definite connection with the 
theory of probabilities because it was devised as a method of 
obtaining a measure of the most probable orbit of a comet, etc. 
In the fitting of a trend line to a single time series there is no 
multiplicity of cases fluctuating in a normal distribution about the 

1 Cf. Kuznets, Simon S., Secular Movements in Production and Prices 
(1930), p. 62, who cites W. H. R. I^exis, Zur Theorie der Massenerscheinungen in 
der menschlichenGesellschaft (Freiburg, i.B., F. Wagner, Ed: 1877), pp. 31-33. 
See also Tintner, Gerhard, *^The Analysis of Economic Time Series,*' 
Journal of the American Statistical Association^ Vol. 35 (1940), pp. 93-100. 
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trend line. The use of the least-squares criterion in trend fitting 
for time series is merely the application by analogy of a method 
that produces desired results; it gives an objective'criterion for 
finding the line of best fit. If the analyst can be satisfied with 
a less objective method, he may use, for example, the method of 
selected points, which will now be described. 

Methods of Selected Points. One of the simplest methods 
of determining the trend of a time series is to make the trend 
‘‘line” pass through certain points selected as representative of 
normal values. This line^ may be drawn in a purely freehand 
fashion, or a mathematical equation may be determined such 
that it is satisfied by the coordinates of the selected points. 

To determine a unique mathematical equation in a given 
case the number of selected points must be taken equal to the 
number of parameters in the equation. Thus, if a straight-line 
trend seems appropriate, two normal years are selected (pref¬ 
erably near the ends of the series) and the values of a and h in 
the equation = a + ht are so determined that the equation 
is satisfied by the values of t and y for the selected points. If 
a parabolic trend of the type y^ = a + ht + is deemed 
appropriate, then three normal points must be selected to 
determine the values of a, 6, and c. In general, if a polynomial 
of the nth degree is taken to portray the course of the trend, 
viz., y' = a + bt + ct^ + • • • + kt^, then there must be n 
selected points. The polynomial is the simplest type of mathe¬ 
matical equation to employ for this purpose. Other, more 
“rational” types may also be fitted by this method, however, 
and its use in fitting a simple logistic curve is described below. 

The actual process of finding the mathematical equation of 
the chosen type that is satisfied by the selected points consists 
in solving n simultaneous equations, n being the number of 
selected points (or the number of parameters to be determined). 
Thus if {ti,yi) and (^ 2 , 2 / 2 ) are the coordinates of the selected 
points, the straight line y' = a + bt passing through these 
points is given by the solution of the following equations for 
a and b: 

2/1 = a + bti 

2/2 = « + bt 2 

1 ‘'Line'* is here used in the generic sense; it may be either straight or 
curved. 
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For example, if the time scale is such that = 3 and ^2 = 9 
and if the y values for these years (or months) are^ yi = 68 and 
2/2 = 110, then a and h are found by solving the equations 

68 = u -f- 36 
110 = a + 96 

These yield a = 47 and 6 = 7; hence the equation for the given 
trend is = 47 + IL 

If the equation to be fitted is a second-degree parabola 
= a + 6i + and if (^i,2/i), (^ 2 , 2 / 2 ), and (^ 3 , 2 / 3 ) are the 
coordinates of the selected points, then a, 6, and c are determined 
by solving the equations 

2/1 = a + 6^1 + ct\ 

2/2 = a + 6^2 + ct\ 

2/3 = a + 6^3 + ct\ 

Three equations are more difficult to solve than two; but if the 
time scale is chosen so that h = 0 , then these reduce to 

yi = a 

2/2 = a + 6^2 + ctl 
2/3 = a + 6^3 + ctl 
or 

2/2 — 2/1 = ^^2 + ctl 

2/3 — 2/1 = bis + Cig 

and two equations are obtained for determining 6 and c, the 
value of a being 2 / 1 . For example, if the selected points are 

(^1 = 0 , = 68 ) (t 2 = 6 , 2/2 = 110 ) (t, = 12 , 2/3 = 200 ) 

then a = 68 and 6 and c may be found from the solution of the 
equations 

no - 68 = 66 + 36c 
200 - 68 = 126 + 144c 

The results are 6 = 3 and c = f = 0.67; hence, the parabola 
which passes through the given points is 

y' = 68 + 3t + 0.67^2, origin at = 0 

When higher degree polynomials are fittied in this way, the 
simultaneous equations may be solved by repeated substitution, 
^ These values may be actual values or values estimated as normal. 
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or special methods making use of finite differences may be 
employed.^ 

Method of Averages, Even less refined methods of fitting 
lines to data than those already described could be applied; in 
fact, the analyst could, if he so desired, merely draw the line 
that seems to fit the plotted data. The objection to this method 
is that it is too subjective—^no two people would draw the same 
line. A certain degree of objectivity is secured by applying the 
method of selected points, which has already been described, or 
by using a modification of that method, namely, the method of 
averages. The method of averages merely suggests a refine¬ 
ment in the selection of the points. It can be illustrated by the 
fitting of a straight line, but it could be applied to curves as well. 


Work Sheet for Fitting a Straight-line Trend by the Method op 

Averages 


t 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


y 

5\ 

2 

7; 

4' 

7/ 

8\ 

15/ 

19> 

181 

15/ 


^3 — 3, 2/3—5 

For t = 3, 2 / is taken as the average of the first five 
2 /^s; that is, = 5. 


For f = 8, y is taken as the average of the last five 
y^s; that is, = 15. 

h = 8, yg = 15 


The trend line is the straight line passing through the two 
points t = 3, y' = 5 and t = 3, y' = 15. Following the same 
procedure as that used in the method of selected points, the 
parameters a and 6 are found by solving the following two 
equations: 

5 = a -jr 36 
16 = a + 86 

from which it is found that 6 = 2 and a = —1, so that the 
trend line is 2 /' = — 1 + 2<. 

Method of Moving Averages. Ordinarily the method of moving 
averages is used with monthly data, but it could be used with 
annual data if an appropriate number of years over which to 

1 For the latter, the reader is referred to E. T. Whittaker and G. Robinson, 
The Calculus of Observations (1924), Chap. I. 
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average or smooth the data could be determined. The difficulty 
of determining the proper number of years for the averaging 
period is one of the objections to this method; another objection 
is that it does not give an equation of trend. The method of 
moving averages is explained in Chap. XXIII, Seasonal Variation. 

Advantages of the Metbod of Least Squares. The advantage 
of using the least-squares line is that it gives a line from which the 
residuals add up to zero and when squared are a minimum; this 
supplies an objective criterion to the fit of the line. In addition, 
the least-squares method of trend fitting is a very flexible device 
that can be mdely applied and varied according to the type 
of line desired. If a complex trend line is desired, a mathe¬ 
matical procedure based upon the least-square criterion is 
handily available. The method of orthogonal polynomials 
explained in the next chapter, for example, is an application of 
the method of least squares. 

ILLUSTRATIONS OF RATIONAL TRENDS 

As indicated in the preceding chapter, rational trends are 
likely to be logistic in character. The simplest type of logistic 
curve is of the form y = ah\ which may readily be reduced to a 
straight line if the equation is expressed in logarithms, as follows: 
log y = log a + t log 6. 

Trend of a Dying Institution. If the early development, 
growth, and arrival at maturity of a new economic institution 
follow the pattern suggested by Raymond B. Prescott, as 
explained in the preceding chapter, presumably the disappear¬ 
ance of a dying institution would follow a reversal of that pattern. 
Thus, it would die slowly at first, then rapidly, and then slowly 
again until it finally disappeared. If such is the case, the 
appropriate equation to use is one of the Verhulst, Pearl-Reed, 
or Gompertz types of curves. However, an economic institution 
that is disappearing from the economic system might depart in 
another manner; it might be struck a sudden devastating blow 
by a new development that caused it to die or decline according 
to the simple logistic curve y = dbK Such appears to be the 
case with respect to a certain type of commercial bank credit 
known as ^^open-market commercial paper.^' Many author¬ 
ities on money and credit believe this to be a dying institution 
in this country; and accordingly the downward trend illustrated 
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in Table 81 and Fig. 144 may be considered a rational trend. ^ 
The data used are annual average monthly volumes of open- 
market commercial paper outstanding; and Table 81 is a work 
sheet for calculating the straight-line logarithmic trend line for 
these data, following the method indicated on pages 566 to 568. 
Here, however, the straight line is fitted to the logarithms of the 
data instead of to the data themselves. 

The equation for this trend line is y' = db\ so that, by the 
rule of logarithms, 

log ?/' = log a + < log h 

The two least-squares equations that would be obtained by the 
method explained above are as follows 

S log y = N log a + log bht 
m log y = log a'Lt + log 

Upon substituting the sums taken from the appropriate columns 
of Table 81, this gives 

36.18035 = 23 log a 
-38.41844 = 1,012 log 6 

from which 

log a = 1.57306 and Tog 6 = —0.037963 

Therefore, the equation of the best-fitting (according to the 
least-squares criterion) logarithmic trend in this case is 

log 2/' = 1.57306 - 0.037963^ 

When a logarithmic straight line is fitted to a time series by 
the method of least squares, it is the sum of the squares of the 
ratio residuals that is made a minimum—and not the sum of the 
squares of the actual residuals as is the case where an arith- 

1 For explanations of the demise of open-market commercial paper see 
B. H. Beckhart, The New York Money Market^ Vol. 3, pp. 242-246; O. A. 
Greef, The Commercial Paper House in the United States (1938), pp. 123—127; 
P. Hunt, Portfolio Policies of Banks in the United States 1920-1929 (1940), 
pp. 11-38. 

* See pp. 566-567. 
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Table 81.— Work Sheet for Calculating Annual Index op Normal 

AND Trend 

Straigkt-line logarithmic trend 

Data: Open-market commercial paper outstanding. Annual averages of 

monthly data 
(In millions of dollars) 

Equation of trend: log y* = 1.57306 — 0.037963^ 


Year 

Raw 

data 

y 

log y 

t 

t log y 

<2 

log of 
trend 
log y' 

Trend 

y' 

Index of 
computed 
trend 
y 
y' 




-13 









-12 






1919 

1,084 

2.03503 

-11 

-22.38533 

121 

1.99065 

979 

110.7 

1920 

1,113 

2.04650 

-10 

-20.46500 

100 

1.95269 

897 

124.1 

1921 

749 

1.87448 

-9 

-16.87032 

81 

1.91473 

822 

91.1 

1922 

768 

1.88536 

-8 

-15.08288 

64 

1.87676 

753 

102.0 

1923 

834 

1.92117 

-7 

-13.44819 

49 

1.83880 

690 

120.9 

1924 

873 

1.94101 

-6 

-11.64606 

36 

1.80084 

632 

138.1 

1925 

743 

1.87099 

-5 

-9.35495 

25 

1.76288 

579 

128.3 

1926 

629 

1.79865 

-4 

-7.19460 

16 

1.72491 

531 

118.4 

1927 

585 

1.76716 

-3 

-5.30148 

9 

1.68695 

486 

120.4 

1928 

494 

1.69373 

-2 

-3.38746 

4 

1.64899 

446 

110.8 

1929 

322 

1.50786 

-1 

-1.50786 

1 

1.61102 

408 

78.9 

1930 

489 

1.68931 

0 

0 

0 

1.57306 

374 

130.7 

1931 

264 

1.42160 

1 

1.42160 

1 

1.53510 

343 

77.0 

1932 

106 

1.02531 

2 

2.05062 

4 

1.49713 

314 

33.8 

1933 

95 

0.97772 

3 

2.93316 

9 

1.45917 

288 

33.0 

1934 

156 

1.19312 

4 

4.77248 

16 

1.42121 

264 

59.1 

1935 

174 

1.24055 

5 

6.20275 

25 

1.38324 

242 

71.9 

1936 

188 

1.27416 

6 

7.64496 

36 

1.34528 

222 

84.7 

1937 

296 

1.47129 

7 

10.29903 

49 

1.30732 

203 

145.8 

1938 

239 

1.37840 

8 

11.02720 

64 

1.26936 

186 

128.5 

1939 

198 

1.29667 

9 

11.67003 

81 

1.23139 

170 

116.5 

1940 

234 

1.36922 

10 

13.69220 

100 

1.19343 

156 

' 150.0 

1941 

317 

1.50106 

11 

16.51166 

121 

1.15547 

143 

221.7 




12 






.... 



13 








36.18035 


-38.41844 

1,012 



2,496.4 

JV = 23 


Slogy 


Sf log y 




'Sy = y' 


Source: Compiled from the Annual Report of. the Federal Reserve Board, 1929, p. 121; 
1935, p, 174; apd from th9 ^wvey of Current Bueineee, Annual Supplement, (Vol. 20, 1940), 
p. 47. 
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metical straight line is fitted.* It is the following expression 
that is minimized: , 

S(log y - log y'Y 

which is the same as 

If the logarithm is expanded in a power series, this sum is seen 
to be roughly equivalent to 





Fig. 144.—Open-market commercial paper outstanding in the United States, 
1919-1941. Logistic trend fitted by method of least squares. 

For a dying institution, open-market commercial paper out¬ 
standing showed remarkable vigor in the years 1933-1941, and 
perhaps the monetary economists were premature in their 
predictions. Whether or not they were is a matter for the 
future to reveal. 

1 Cf. pp. 566-567. 
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Trend of a Growing Institution. Method of Selected Points 
Illustrated, If the hypothesis made by Raymond B. Prescott 
can be demonstrated or illustrated in real life, it should surely be 
done by the development of the automobile during the past 
three or four decades. Ta*ble 82 and Fig. 145 give an illustration 
of the fitting of a rational trend that purports to represent this 
type of growth, constituting thereby a test of this hypothesis.^ 
They also illustrate the method of fitting a logistic curve of the 
Pearl-Reed type by using selected points. 

The equation of the curve may be written in the form 


y = 


\ m 


( 8 ) 


in which m = e^^^K 

It is thus required to find three parameters k, a, and 2>, which 
is more conveniently done by first converting the equation into 
logarithms, as indicated in the work sheet. 

By using annual data, consisting of monthly average output 
of passenger cars and trucks each year from 1903 to 1941, a 
graph was made and from its examination the following selected 
points were adopted: 


1909 

1922 

1935 

<0=0 

ti = 13 

<2 = 26 

2/0 = 10 

2/1 = 250 

2/2 = 320 


The values of the parameters k, a, and h may be found by using 
the following equations 

^ 2yoyiy2 - y\{ya + 2 / 2 ) 

2/02/2 
k - yo 


y\ 


a = loge 


in which n is defined as <2 — h. 


2/0 

yojk - yi) 


( 9 ) 


^ Explained on pp. 553-555. 

2 Of. Pearl, Raymond, Studies in Human Biology, (1924), Chap. XXIV; 
The Biology of Population Growth (1925), p. 22. Citations used from 
F. E. Croxton and D. J. Cowden, Applied General Statistics (1939), pp. 444- 
445, 852-853. 
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Thus, for the problem illustrated, 

2 X 10 X 250 X 320 - 250 X 250 X 330 
10 X 320 - 250 X 250 
_ 5(128 - 1,650) -7,610 

M - 25 - 23.72 

= 320.82630 

, 320.82630 - 10 

„ = loge-Jo- 

= loge 31.082630 
= 2.302585 logic 31.082630 
= 2.302585(1.4925178) 

= 3.4366491 

1 , 10(320.82630 - 250) 2.302585 70.82630 

13 250(3“2().82630 - 10) 13 7,770.6575 

= 0.1771219 (logic 0.00911458) 

[logic 0.00911458 = 7.9597368 - 10] 
= 0.1771219 (-2.0402632) 

= -0.3613753 

As indicated in Table 82, the values for m for various values of 
t are conveniently found by the use of logarithms; thus, since 

m = g»+w 

log m = logic e(a + bt) [since logic e = 0.43429] 
= 0.43429(a + bt) 

or, for the example illustrated, 

log m = 0.43429(3.4366491 - 0.36137534) 

= 1.4925023 - 0.15694174 

For the year 1909, when 4 = 0, the value of log m is 1.4925023, 
as may be seen from the work sheet (Table 82), and the values 
of log m for other values of 4 are obtained by the successive 
algebraic subtraction of the constant —0.1569417 through the 
years preceding 1909 and by successive algebraic addition of the 
constant —0.1569417 through the years subsequent to 1909. 
These are the logarithms of m for the various values of 4, that 
is, for the various years. In the next column of the work sheet, 
the antilogarithms are entered, which, when added to 1, are 
divided into the constant k in order to find the trend values for 
each year. These steps are shown in the next three columns of 
Table 82. An index of normal, that is, y/y', is also calculated. 
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Table 82. —Work Sheet for Calculating Index op Normal and Trend 
Logistic trend of the Pearl-Reed type 

Data: Automobile production in the United States. Annual averages of 

monthly data 
(In thousands of cars) 


Year 

t 

log m 

m 

1+m 

\ m 

y 

y 

y' 

1903 

-6 

2.4341525 

271.7393 

272.7393 

1.176 

0.9 

76.5 

1904 

-5 

2.2772108 

189.3261 

190.3261 

1.686 

1.9 

112.7 

1905 

-4 

2.1202691 

131.9073 

132.9073 

2.414 

2.1 

87.0 

1906 

-3 

1.9633274 

91.90234 

92.90234 

3.453 

2.8 

• 81.1 

1907 

1-2 

1.8063857 

64.03030 

65.0303 

4.933 

3.7 

75.0 

1908 

-1 

1.6494440 

44.61122 

45.61122 

7.034 

5.4 

75.8 

1909 

0 

1.4925023 

31.08152 

32.08152 

10.000 

10.9 

109.0 

1910 

1 

1.3355606 

21.65515 

22.65515 

14.161 

15.6 

110.2 

1911 

2 

1.1786189 

15.08757 

16.0876 

19.942 

17.5 

87.8 

1912 

3 

1.0216772 

10.51177 

11.5118 

27.869 

31.5 

113.0 

1913 

4 

0.8647355 

7.32380 

8.3238 

38.543 

40.4 

104.8 

1914 

5 

0.7077938 

5.10263 

6.1026 

52.572 

47.4 

90.2 

1915 

6 

0.5508521 

3.55510 

4.5551 

70.432 

80.8 

114.7 

1916 

7 

0.3939104 

2.47691 

3.4769 

92.273 

134.8 

146.1 

1917 

8 

0.2369687 

1.72571 

2.7257 

117.704 

156.2 

132.7 

1918 

9 

0.0800270 

1.20234 

2.2023 

145.675 

97.6 

67.0 

1919 

10 

-0.0769147 

0.83769 

1.8377 

174.581 

161.1 

92.3 

1920 

11 

-0.2338564 

0.58364 

1.5836 

202.588 

185.6 

91.6 

1921 

12 

-0.3907981 

0.40663 

1.4066 

228.082 

134.7 

59.0 

1922 

13 

-0.5477398 

0.28331 

1.2833 

249.999 

212.0 

84.8 

1923 

14 

-0.7046815 

0.19739 

1.1974 

267.938 

336.2 

125.5 

1924 

15 

-0.8616232 

0.13752 

1.1375 

282.040 

300.2 

106.4 

1925 

16 

-1.0185649 

0.095815 

1.0958 

292.773 

355.5 

' 121.4 

1926 

17 

-1.1755066 

0.066756 

1.0668 

300.748 

358.4 

119.2 

1927 

18 

-1.3324483 

0.046511 

1.0465 

306.568 

283.4 

92.4 

1928 

19 

-1.4893900 

0.032405 

1.0324 

310.758 

363.2 

116.9 

1929 

20 

-1.6463317 

0.022577 

1.0226 

313.742 

446.5 

142.3 

1930 

21 

-1.8032734 

0.015730 

1.0157 

315.858 

279.7 

88.6 

1931 

22 

-1.9602151 

0.010959 

1.0110 

317.348 

199.1 

62.7 

1932 

23 

-2.1171568 

0.007636 

1.0076 

318.394 

114.2 

35.9 

1933 

24 

-2.2740985 

0.0053199 

1.0053 

319.128 

160.0 

50.1 

1934 

25 

-2.4310402 

0.0037065 

1.0037 

319.640 

229.4 

71.8 

1935 

26 

-2.5879819 

0.0025824 

1.00258 

320.001 

328.9 

102.8 

1936 

27 

-2.7449236 

0.0017992 

1.00180 

320.250 

371.2 

115.9 

1937 

28 

-2.9018653 

0.0012535 

1.00125 

320.426 

400.7 

125.0 

1938 

29 

-3.0588070 

0.0008734 

1.00087 

320.547 

207.4 

64.7 

1939 

30 

-3.2157487 

0.0006085 

1.00061 

320.631 

298.1 

93.0 

1940 

31 

-3.3726904 

0.0004239 

1.00042 

320.692 

372.4 

116.1 

1941 

32 

-3.5296321 

0.0002954 

1.00030 

320.730 

403.2 

125.7 








y—, = 3,788.71 










Source: Data from Statistical Abstract of the United StatBs^ 1933, p. 334; and Survey of^ 
Current Business^ Annual Supplement, Vol. 12 (1932), and current issues, passim, 

* ib - 320.82630. See p. 579. 

t If the curve had been fitted according to the least-squares criterion, this sum would 
approximate a hufidrecl t^ee tl\Q mimber of years, that is, 3,900. 
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of fit of the trend is attested to, not only by the plotting of the 
curve with the data in Fig. 145, but also by the fact that the 
sum of the ratios of the raw data to the trend equals approxi¬ 
mately a hundred times the number of years. 
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ILLUSTRATIONS OF EMPIRICAL TRENDS 

The distinction between rational trends and empirical trends 
lies, not in the method of calculation, but in the interpretation 
and analytical use of the trend after it is calculated. Yet, in the 
case of empirical trends, it frequently suffices to fit a trend line 
of very simple character. Thus a straight line may be quite 
adequate in some cases. 

Straight-line Trend. Table 83 contains a work sheet for 
calculating a straight-line trend in open-market commercial 
paper outstanding for the period 1931-1941. Rationalization 
of this trend is uncertain—it may be the commencement of a new 
period of growth in what was supposed to be a dying institution. 

Table 83.— Work Sheet for Calculating Index of Normal and Trend 

Straight line 

Data: Open-market commercial paper outstanding. Annual averages of 

monthly data 
(In millions of dollars) 

Equation of trend: y* = 206 + 12.49^ (origin at 1936) 

Source: Annual Report of the Federal Reserve Board, 1929, p. 121; 
1935, p. 174. Survey of Current Businessj Annual Supplement, Vol. 20 
(1940), p. 47 and current issues, passim. 



(2) 

(^) 

(4) 

(5) 

(^) 

(7) 

Year 

Raw 

data 

V 

t 


ty 

Trend 

y' 

Index of 
computed 
trend 

y. 

y' 

1931 

264 

-5 

25 

-1,320 

144 

183.3 

1932 


-4 

16 

-424 

156 

67.9 

1933 

95 

-3 

9 

-285 

168 

56.5 

1934 

156 

-2 

4 

-312 

181 

86.2 

1935 

174 

-1 

1 

-174 

194 

89.7 

1936 

188 

0 


0 

206 

91.3 

1937 

296 

1 

1 

296 

218 

135.8 

1938 

239 

2 

4 

478 i 

231 

103.5 

1939 

198 

3 

9 

594 

243 

81.5 

1940 

234 

4 

16 

936 

256 

91.4 

1941 

317 

5 

25 

1,585 

268 

118.3 


2,267 



1,374 


1,105.4* 

N = 11 

Sy 



^ty 


SI- 


* This total is a cross check on the work sheet; it should equal a hundred times the number 
of years. Failure to check precisely is due to rounding. 
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or it may be merely a cyclical movement. At any rate, for the 
period of 11 years selected the trend analysis ma) 5 :es possible a 
better study of the shorter term cyclical or residual movements in 
the data. 

The work sheet contains all the information necessary to 
calculate the equation of trend, which in this case is of the simple 
form, y' = a bt. As seen in Eq. (5), the equation is found 
by the following: 


ly' = ^ 4- ^ / 


N Si* 


From the work sheet, for this particular problem, 
Sy = 2,267 Si = 0 Si* = 110 Siy = 1,374 


Accordingly, 


2,267 
® 11 
6=^74 
no 


= 206 
= 12.49 


JV = 11 


and the equation of trend is y' = 206 + 12.49^ (origin at 1936). 
It is necessary to specify the origin in order to know for which 
year i = 0. If the origin were 1931, the equation would be 
2 /' = 144 + 12.49i (origin at 1931); this equation describes the 
same straight line as y' = 206 + 12.49^ (origin at 1936). 

Column (6) of the work sheet contains the solutions of the 
trend equation for the respective values of t Thus, for 1933, 
i = —3, and the solution of the trend equation for that year is 
y' = 206 + (-3)(12.49) = 168. 

Column (7) is the index of computed trend, each y of the raw 
data divided by the corresponding y' of the trend, and the result 
expressed as a percentage. Thus, 264 is 183.3 per cent of 144. 

Polynomial Empirical Trends. Laborsaving Devices. It is 
possible to find a second-degree, third-degree, or higher degree 
polynomial trend by the methods already illustrated. To fit a 
second-degree polynomial, according to the least-squares criterion, 
the work sheet would be like that illustrated in skeleton form 
on page 570. But it is better, for practical use, to introduce two 
important sets of laborsaving devices before proceeding to fit 
the higher order polynomial trends. The first set of laborsaving 
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devices has to do with economy of calculation in the work sheet; 
the second set has to do with solving the equations for different 
values of t, therefore, with computing trend values for various 
years. 

Economy of Calculation in the Work Sheet. As already noted, 
an economy was obtained by taking an odd number of years and 
making the median year the origin, so that Xt = 0, 2^® = 0, and 
similarly the total of all odd powers of t mil be equal to zero; 
hence, columns in the work sheet for odd powers of t are not 
required. In addition, the entry of columns in the work sheet 
for the even powers of t may be avoided because Xt^, 'Et\ 
etc., can be computed from formulas. It can be shown by 
algebraic derivation^ that, if t runs integrally from t — ±1 to 


t = 


± (n — 1), in which n = 


N + 1 
2 


■y i.e., n = t (terminal value) 


+ 1 , 


3 

3n2 




1 ) 


3 n 




( 10 ) 


3 n 2 (n^ — 2 n) + 3 n + 




By similar algebraic computation 2^® can be evaluated in terms 
of n, but it is preferable to use orthogonal polynomials if a trend 
equation of fourth or higher degree is sought. 

A second economy for the work sheet is secured by using a 
subtotal summation procedure by which aggregates Siy ^2, Sz, 
Si, etc., are obtained. From these aggregates algebraic formulas 
are used to compute as follows: 


Sy = Si 

Xty = nSi — S2 

2 % = n^Si - ( 2 n + 1)^2 + 2S3 

2 < 3 y = n^Si - ( 3 n 2 + 3 n + 1)S2 + 6(n + 1)S3 - 6S4, 

m which n = —s— 


( 11 ) 


^ Cf. Ross, Frank A., ‘^Formulae for Facilitating Computation in Time 
Series Analysis,'' Journal of the American Statistical Associalion, Vol. 20 
(1925), pp. 75-79. For method of proof see footnote 1 p. 586. 
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Table 84 . —Economical Work Sheet for Calculating Polynomial 
Trend, Algebraic Illustration 
Method of least squares 


Year j 

Data 

Sets of subtotals 


T 

t 

y 

First 

Second 

Third 

(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

1937 

1 

-2 

yi 

yi 

yi 

yi 

1938 

2 

-1 

yi 

1/1 + 1/2 

2yi + yi 

3yi + yi 

1939 

3 

0 

yz 

yi + 1/2 + yz 

3yi + 2y2 + Vz 

6yi + 3y2 + yz 

1940 

4 

1 

y* 

1/1 + 1/2 + 1/3 + 1/4 

4yi + 3y2 + 2y3 + y4 

lOyi + 6 y 2 + Zyz + yi 

1941 

5 

2 

yz 

yi-\-y 2 + yz + yi + yz 

5yi +■ iyi + 3y3 + 2yA + yz 

15yi + 10y2 + Zyz + 3y4 + ys 


1 

1 

m 

Si 

Sz 

Si 


The subtotal summation process is illustrated algebraically in 
Table 84 and arithmetically in Table 85 . The sum of column 
( 4 ), Si, is merely 2 i/. Column ( 5 ) contains the first set of sub¬ 
totals, which is obtained on the adding machine by taking a sub¬ 
total after entry of each item in column ( 4 ); the first subtotal 
in column ( 5 ) Avill thus be the first item of column ( 4 ), therefore, 
yi, the second subtotal in column ( 5 ) mil be yi + t/2, the third 
subtotal will be 2 /i 2/2 + Vzj etc. ^2 is the sum of these 

subtotals. 

The second set of subtotals, column ( 6 ), consists of subtotals 
of the figures in the preceding column, column ( 5 ); thus the first 
subtotal in column ( 6 ) is yi, the second subtotal is 2^/1 + 2/2> the 
third subtotal is Zyi + 2^/2 + etc. Sz is the sum of the 
second set of subtotals. 

The third set of subtotals, column ( 7 ), consists of the subtotals 
of figures in column (6); and S4 is the sum of this third set of 


subtotals. 

This process of taking subtotals and aggregating the subtotals 
by columns to obtain S2, Sz, Sa can be repeated to as many as 
desired, depending on how high degree a polynomial is to be 
fitted. If carried as far as Sa, a third-degree polynomial can be 

fitted. ; m 1.1 oc O • 

A cross check on the work sheet is noted m Table 85 : is 

equal to the final subtotal in column ( 5 ), S2 is equal to the final 

subtotal in column (6), S 3 is equal to the final subtotal m column 

(7), etc. 
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Table 85.—Economical Work Sheet for Calculating Polynomial 
Trend, Arithmetical Illustration 
{Method of least squares) 


Year 

Data 

Sets of subtotals 


T 

t 

y 

First 

Second 

Third 

(1) 

(2) 

(3) 

(4) 

(6) 

(6) 

(7) 

1937 

1 1 

-2 

2 

2 

2 

2 

1938 

2 1 

-1 

5 

7 

9 

11 

1939 

3 

0 

8 

15 

24 

35 

1940 

4 

1 

7 

22 

46 

81 

1941 

5 

2 

9 

31 

77 

158 




31 

77 

158 

287 




s. 

S2 

Si 

S, 


From Table 84 it can be readily seen that algebraically 

51 = 2/i + 2/2 + 2/3 + • * * + 2 /n 

52 = Nyi + (iV — 1)2/2 + (^ ~ 2)2/3 + * * * + 2/V ( 12 ) 


, _N{N + 1)^, , (N^DN^^ , (N-2){N-1) 

hz =-o-2/1 H—:—o-2/2 H-o- 


2/3 


+ 


+ 2/n 


For the coefficient of yi in this sum is equal to the sum of the natural 

N 

numbers from 1 to N, therefore, ^ T, which equals —^-* 

cient of is the sum of the natural numbers from 1 to iV — 1, which equals 


; the coeffi- 


(N - 1)N 


; etc.i 


„mN + l)(N + 2) {N-l)iN)(N + l)^^ . 

bi = —-s- yi H-«-2/2 + • • • + j/w 


^ This may be demonstrated as follows: 
sr*l+2+3 + 4+ -- -+ iV 
Also, 

ST = W + (W - 1) + (iV - 2) + (W - 3) + • • • +1 
By adding, 

2ST = (W + 1) + (^ + 1) + (iV^ + 1) + + 1) + • • • + (i^^ + 1) 

= N{N + 1) 
and hence 


N{N + 1 ) 
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For the coefficient of 2/1 is the sum of 1 ) as N goes from 1 to N; 

.u- 1 + 1)(^' +2) , 

this sum equals —~———< • etc 
0 ’ 


It Avill be convenient to express these sums in the following 
manner; 

51 = Sj/ \ 

5 2 = 2(iV +l - T)y \ 

In the case of 1/1, JV + 1 - T = Af. In the case of I 

N + l- T^N-l. I 

In the case of 2/3, iV + 1 — 2" = iV — 2 . Etc. I 

e V (^ + 1 - T){N + 2-7’) I 

Ss = 2^-2-- y I 

In the case of 2/., m + ,, 

Z 2 \ 


the case of 2/2, 


(AT + 1 - r)(Af +2 - T) _ (AT - DAI 


^ V (iv + 1 


- T){-N + 2 - 7’)(iV + 3 - T) 


In the case of 2/1, 

(AT + 1 - r)(iv + 2 - T){N + 3 - T) _ N{N + l)(Ar + 2 ) 
6 6 

In the case of 2/2, 

(AT + 1 - T)(Af + 2 - DCJV + 3 - T) {N - 1)N{N + 1) 


N + 1 


In the above equations, if T is replaced by < + 7’ = i H- ^ 


N + 1 


and if, by definition, n = — 2 —’ equations become 


51 = 

52 = n'Ly — 'Lty 
2S3 = S(n - tYy + wSy - Sfy 
6S4 = S(n - t)^ + 3 S(n - t)^ + 2 nSy - 22 <j/ 

in which the unmarked 2 refers to summations with respect to 

t from + ^ to - ^ J - - - If these equations are expanded 
2 Z 

and similar terms assembled^ Eqs. (11) are obtained. 
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Table 85.—Economical Work Sheet for Calculating Polynomial 
Trend, Arithmetical Illustration 
{Method of least squares) 


Year 

Data 

Sets of subtotals 


T 

t 

y 

First 

Second 

Third 

(1) 

(2) 

(3) 

(4) 

(6) 

(6) 

(7) 

1937 

1 

-2 

2 

2 

2 

2 

1938 

2 

-1 

5 

7 

9 

11 

1939 

3 

0 

8 

15 

24 

35 

1940 

4 

1 

7 

22 

46 

81 

1941 

5 

2 

9 

31 

77 

158 




31 

77 

158 

287 






.5^3 

S ^ 


From Table 84 it can be readily seen that algebraically 


iSi = 2/1 + 2/2 + 2/3 + • • • + 2/Jf 

S 2 = Nyi + (N — 1)^2 + — 2)^3 + • • • + y.v 

„ __N(N+ 1) .. . (N- 1)N^^ . (N - 2)iN - 1) 
Sz = —2—+ ^“2—+- 2 - 


yz 


( 12 ) 


+ • • • + yn 


For the coefficient of yi in this sum is equal to the sum of the natural 

N 

numbers from 1 to N, therefore, ^ P, which equals • the coeffi¬ 

cient of 2/2 is the sum of the natural numbers from 1 to A — 1, which equals 


(N - 1)W. 


; etc.^ 


„ N{N + l)iN + 2) {N-1){N){N + 1) 

04 = -0-2/1 H-“0-2/2 -r • • * + 

1 This may be demonstrated as follows: 

S!r = l+ 2+ 3+ 44----+iVr 

Also, 

ST * W + (W - 1) + (W - 2) + (A - 3) + • • • -hi 
By adding, 

2S!r = (W + 1) + (W + 1) + (W -hi) -h (W + 1) + • • • +{N + 1) 
= N{N + 1) 
and hence 
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N(N 4- 1) 

For the coefficient of yi is the sum of —- - as N goes from 1 to N; 


this sum equals 


N{N + l)(iV^ + 2) 


; etc. 


It will be convenient to express these sums in the following 
manner: 


51 = St/ 

5 2 = SCAT + 1 - T)y 

In the case oi yij N 1 — T = N.. In the case of 2/2 
N + 1- T=^N-1. 

In the case oi y^j N + 1 — T = N — 2. Etc. 

In the case of ^ r . ^ + g.j, | 

the case of ,, ^ + - .. T ) ^ - M . Etc. 


s. 


-I 


(N + l - T){N + 2 - T)(N + 3 - r) 


y 


In the case of 2 / 1 , 

(iV' + 1 - T)(N + 2 - T){N + 3 - D _ NiN + DiN + 2) 
6 6 ' 

In the case of 2 / 2 , 

(i\r + l - T){N -{■2-T){N + Z-T) __ {N-^1)N{N-Y1) 

6 6 


(13) 


/ 

N + 1 


— IV “t 

In the above equations, if T is replaced hy t + T = t -\ - 2 

N + 1 


and if, by definition, n = 


these equations become 


51 = Si/ 

5 2 = nXy — 2ty 
2 S 3 = S(n — tyy + nSi/ — S^i/ 

61 S 4 = S(n - tyy + 3S(n - tyy + 2nSi/ - 2S<i/ 

in which the unmarked S refers to summations with respect to 
N — 1 JV — 1 

t from H- 2 — - 2 these equations are expanded 

and similar terms assembled. Eqs. ( 11 ) are obtained. 
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Table 86.—Work Sheet for Calculating Trend—Second-degree 

Polynomial 
Method of least squares 

Data: Consumer expenditures for personal appearance and comfort. 
Annual data in millions of dollars 


Source of Data: Survey of Current Business, October, 1942, p. 24. 


Year 



Sets of subtotals 


y 

First 

Second 

1929 

-6 

655 

655 

655 

1930 

-5 

630 

1,285 

1,940 

1931 

-4 

540 

1,825 

3,765 

1932 

-3 

' 427 

2,252 

6,017 

1933 

-2 

347 

2,599 

8,616 

1934 

-1 

393 

2,992 

11,608 

1935 

0 

441 

3,433 

15,041 

1936 

1 

503 

3,936 

18,977 

1937 

2 

545 

4,481 

23,458 

1938 

3 

543 

5,024 

28,482 

1939 

4 

540 

5,564 

34,046 

1940 

5 

568 

6,132 

40,178 

1941 

6 

653 

6,785 

46,963 



6,785 

46,963 

239,746 



Sy 

S^ 

Sz 


By using Eqs. (11), page 584, the follo\ving values are obtained: 


Xy = 6,785 

Xty = 7(6,785) - 46,963 = 532 
i:t^y = 49(6,785) - 15(46,963) + 2(239,746) = 107,512 

By using Eqs. (10) the following values are obtained: 

2^2 = = 182 
O 

= 182 ^ = 182(25) = 4,550 

To find the second-degree polynomial trend equation these values 
may be substituted in Eqs. (7), (i) to (iii), page 570, as follows: 

(i) 6,785 = 13a + 182c 

(ii) 532 = 1826 6 = 2.923 

(iii) 107,512 = 182a + 4,550c 

(i') 94,990 = 182a -|- 2,548c (i) X 14 

12,522 = 


2,002c 


c = 6.2547 
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(iii) 107,512 = 182a + 4,550c 

(i") 169,625 = 325a + 4,550c (i) X 25 

62,113 = 143a a"= 434 

Accordingly, the second-degree polynomial equation of trend 
for the problem illustrated is = 434 + 2.923^ + 6.2547^^ 
(origin at 1935). 

Finding Trend Values by Method of Finite Differences, The 
equation for a trend line having been found, the problem is to 
compute from this equation the values of y' pertaining to a given 
set of years. Direct substitution is laborious. Finite differ¬ 
ences provide an easier method. The keystone of the latter 
method is the fact that the nth difference of a polynomial of the 
nth degree is constant. Hence, that constant nth difference 
having been determined, the other differences, and ultimately 
the desired trend values themselves, can all be computed by 
merely reversing the differencing process, i.c., by simple addition. 

In the equation 2 /' = a + + ct^ the first difference, by 

definition, would be 

= a “i" b{t -j- 1) + c{t “b 1)^ — a — bt — ct^ = b 2ct + c 
and, by definition, the se\;ond difference would be 

Ay = b + 2c{t + 1) + c - 6 - 2ci - c = 2c 


Table 87.— Building Up a Polynomial by Finite Differences 


(1) 

(2) 

(3) 

(4) 

(6) 

(6) 

t 

Fourth 

difference 

A*y' 

Third 

difference 

Second 

difference 

First 

difference 

Aij/' 

Polynomial 

(trend) 

values 

y' 

-4 

-1 

6 

-48 

220 

305 

-3 

-1 

6 

-42 

172 

525 

-2 

-1 

4 

-37 

130 

697 

-1 

-1 

3 

-33 

93 

827 

0 

-1 

2 

-30 

60 

920 

1 

• • • 

1 

-28 

30 

980 

2 



-27 

2 

1,010 

3 

» • • 



-25 

1,012 

4 

... 




987 
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Table 87 illustrates the building up of a polynomial by finite 
differences. The polynomial here is of the fourth degree, and 
hence its fourth differences are all identical. 

A figure in a given line of column (6) algebraically added to 
the figure in the same line of column (5) gives the next figure for 
column (6); thus, 305 + 220 = 525, 525 + 172 = 697, etc. 
Similarly, a figure in a given line of column (5) added algebraically 
to the figure in the same line of column (4) gives the next figure 
for column (5); thus, 220 ~ 48 = 172, 172 — 42 = 130, etc. 
The same general rule applies to the figures in columns (2) 
tod (3); thus, —48 + 6 = —42, —42 + 5 = —37, etc., and 
6 - 1 = 5, 5 - 1 = 4, etc. 

In the polynomial illustrated in Table 87, the polynomial is 
known to have a constant fourth difference. Hence, if the 
polynomial value and the differences of any one line are all 
known, then the differences and polynomial values for all other 
lines above or below the given line can be readily computed. 
Thus, if for < = 0, it is known that the polynomial value y' = 920, 
A^yo = 60, A^y'o = —30, A Vo = 2, and the constant fourth dif¬ 
ference is equal to —1, then, by working from right to left and 
up and down, the other values in the tables can be built up. 
The first set of variable differences, in this case the third, can be 
built up cumulatively from the known AVo = 2 and the con¬ 
stant difference —1. It is to be noted that in a downward 
direction, in this table, this constant difference is —1; so in 
building up from the bottom to the top the constant difference is 
algebraically —( — 1), or +1. This rule follows also for the 
building up of the other differences. 


Table 88.—Aid for Computing Finite Differences at ^ = 0 in 
Polynomial ?/' = a -f- 6^ -f d* + + • • • 


Parameter 

(1) 

(2) 

(3) 

(4) 

(6) 


A^yo' 

AW 

AW 

AW 


h 

1 





C 

1 

2 




d 

1 

6 




e 

1 

14 


24 


f 

1 



240 

120 
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This method is of general validity and can be used to find 
values of a polynomial of any degree from knowledge of its 
value for one year and its differences for that same year. For¬ 
tunately, it is relatively easy to calculate the or polynomial 
value and the various differences for the year i = 0. If the 
form of the polynomial is i/' = a + + ct^ + dt^ + et^ + • • • , 

then the polynomial value for i = 0 is y' = a. The first, second, 
and higher order differences for < = 0 can be computed with the 
help of Table 88. 

The figures in Table 88 give the weights by which the param¬ 
eters 6, c, d, 6, /, etc., must be multiplied to give the difference 
specified at the top of each column, as follows: 

= 5 + c + d + 6+ • • • 

A^i/'o = 2c + 6d + 146 + 30/ + • • • 

= fid + 3fi6 + 150/ + • • • 

A V'o = 246 + 240/ + • • • 

= 120 / + • • • 

For a particular polynomial, each of these equations, of course, 
terminates with the coefficient of Thus, for a second-degree 
polynomial y' ^ a + ht + the formulas for the differences at 
i = 0 would be A^y^ = b + c and A^y^ = 2 c, the higher differ¬ 
ences being zero since the second difference is the same for all 
values of L For a third-degree polynomial, 

y' = a + bt + ct^ + dt^ 

the formulas would be A^o = b + c + d, A^y[ = 2c + fid, and 
A^2 /o = fid. For a fourth-degree polynomial 

2 /' = a + 6^ + c^2 + dt^ + et^ 
the differences would be A^i/e = 5 + c + d + ^; 

AVJ = 2c + fid + 146, A^'^ = fid + 3fi6 

and A^J = 246. 

For higher degree polynomials, the table can be readily 
extended by the rule that a figure in a given line of a given column 
is equal to the number of the column multiplied by the sum of 
the two figures in the line above situated in the given column 
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and in the column to the left, respectively. For example, 
36 = 3(6 + 6); 24 = 4(0 + 6); etc.^ 

The use of finite differences to compute the trend values of a 
second-degree polynomial is illustrated in Table 89. The trend 
is 2 /' = 434 + 2.923i + 6.2547^2 (origin at 1935), calculated 
above in Table 86 for data on consumer expenditures for per¬ 
sonal appearance and comfort in the United States, 1929-1941. 

In Table 89, the constant second difference is known to be 
12.509; the first difference for < = 0 is 9.2,* and the trend value 
for i = 0 is y' = 434.0. These are first entered in the work 

Table 89.— Work Sheet for Computing Trend Values by Method op 
Finite Differences 

Equation of trend: = 434 + 2.923^ + 6.2547^* 

Value of 2/o = 434 

Value of = 2.923 + 6.2547 = 9.1777 


Constant = 12.509 


Year 

t 

A2j/' 


1 /' 

1929 

-6 


-65.9 

641.5 

1930 

-5 


-53.4 

575.6 

1931 

-4 


-40.8 

522.2 

1932 

-3 


-28.3 

481.4 

1933 

-2 


-15.8 

453.1 

1934 

-1 


-3.3 

437.3 

1935 

0 

12.509 

9.2 

434.0 

1936 

1 


21.7 

443.2 

1937 

2 


34.2 

464.9 

1938 

3 


46.7 

499.1 

1939 

4 1 


59.2 

545.8 

1940 

5 


71.7 

605.0 

1941 

6 



676.7 


sheet; then, since the constant second difference is positive, the 
remainder of the column of first differences is obtained by suc¬ 
cessively subtracting 12.509 to obtain first differences for earlier 
years and by successively adding 12.509 to obtain first differences 
for later years. Obtaining the trend values is illustrated as 
follows: 434.0 + 9.2 = 443.2, 443.2 + 21.7 = 464.9, etc.; for 
values before 1935, 434.0 - (-3.3) = 437.3, 437.3 - (-15.8) 

1 Cf. Whittaker and Robinson, oy, cit.j pp. 1~7. 

* The first differences may be roilnded without causing cumulative error. 
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= 453.1, etc. Beginning at the top of the table, it will then 
be found that 641.5 — 65.9 = 575.6, 575.6 — 53.4 = 522.2, etc. 

While the explanation of the method of finite differences may 
be extended, its use in solving second-degree polynomials for 
various values of t is much more expeditious than the method of 
obtaining solutions to the equation for the various values of t by 
substitution in the equation. The labor involved in the longer 
method is great if the number of years is large or if the poly¬ 
nomial is of higher than a second degree. In contrast, the 
method of finite differences may be used without difficulty, and 
the arithmetic involved is always simple addition or subtraction. 

A danger inheres in the use of finite differences, namely, that 
any error in the higher order differences is cumulated as the lower 
order differences are computed. For this reason, when the 
trend line is determined the coefficients of the higher powers of 
t should be carried out to a larger number of places than would 
be regarded as significant. If, for example, the coefficient of 
is rounded off to the fifth place, the maximum error in the fourth 
difference is 24 X 0.000005 = 0.00012, over a 7-year period. 
If the other coefficients have also been rounded off to the last 
place indicated, then the maximum error in 

A^i/'o = 6(0.00005) + 36(0.000005) = 0.00048 
in 

A^i/'o = 2(0.0005) + 6(0.00005) + 14(0.000005) = 0.00137 
in 


A}y[ = 0.005 + 0.0005 + 0.00005 + 0.000005 = 0.0055555 


Table 90.— Maximum Cumulated Erroks in Differences and Poly¬ 
nomial Values 
Error in = 0.000120 


t ' 

A^i/' 

A*!/' 

AV 

y' 

0 

0.000480 

0.001370 

0.005555 

0.050000 

1 

0.000600 

0.001850 

0.006925 

0.055555 

2 

0.000720 

0.002450 

0.008775 

0.062480 

3 

0.000840 

0.003170 

0.011225 

0.071255 

4 


0.004100 

0.014395 

0.082480 

5 



0.018495 

0.096875 

6 



0.115370 
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and in y[ = 0.05. Thus, by the time 2 /^, the seventh year, for 
example, has been computed, the maximum error in that figure 
becomes 0.11+. This is shown in Table 90. 

The final error grows larger the further the work proceeds and 
thus makes it necessary to compute the coefficients of higher 
powers of t to several figures beyond the number of significant 
figures required in the computed trend values. A cross check 
on the method of finite differences would be to solve the poly¬ 
nomial equation for the terminal values of L 

The danger of cumulative error is reduced to a minimum by 
starting at ^ = 0 and accumulating upward through the — ^^s 
and accumulating downward through the +^’s. 

Analysis of Cycles by Empirical Trends. Data on plate- 
glass production in the United States, 1933-1941, have been 
selected, in order to illustrate how cycles may be studied by 
empirical trend analysis. Table 91 is a work sheet providing the 
figures needed to compute either a straight-line trend or any 

polynomial trend up to the third degree. 

» 

Table 91.— Wokk Sheet for Computing Trend and Index op Normal 
Method of least squares 

Data: Production of plate glass, polished, in the United States 
(In millions of square feet, monthly) 

Source: Survey of Current Business^ Supplement, Vol. 20 (1940), p. 151; 
Vol. 21 (February, 1941), p. 99, Annual (March, 1942), p. S-35. 


Year 

t 

y 

Sets of subtotals 

First 


Third 

1933 

-4 

7.2 

7.2 

7.2 

7.2 

1934 

-3 

7.9 

15.1 

22.3 

29.5 

1935 

-2 

15.0 

30.1 

52.4 

81.9 

1936 

-1 

16.5 

46.6 

99.0 

180.9 

1937 

0 

16.0 

62.6 

161.6 

342.5 

1938 

1 

7.1 

69.7 

231.3 

573.8 

1939 

2 

11.8 

81.5 

312.8 

886.6 

1940 

3 

13.7 

95.2 

408.0 

1,294.6 

1941 

4 

15.9 

111.1 

619.1 

1,813.7 



111.1 

519.1 

1,813.7 

6 , 210.7 



Si 

-S 2 

Si 

81 


By using Eqs. (10), the values of and 2^® are found 

as follows: 
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S'- 

S'- 

S“ 


=: w(ra - l)i2n - 1) ^ 5(4) (9) ^ 

^ ^ - 2n) + 3w + 1 j 

= 60 [ 7 5(2 5 _- 1 (^^5 + Ij ^ 
= 9,780 



= 708 


= 60(163) 


By using Eqs. (11), the values of 2<j/, S<^i/, and Zt^y are found 
as follows: 


22/ = /Si = 111.1 

Xty = n,Si -82 = 5(111.1) - 519.1 = 36.4 
2it^y = n^Si — {2n + 1 )/S 2 2 S 3 

= 25(111.1) - 11(519.1) + 2(1,813.7) 

= 694.8 

Xt^y = n’/Si - (3n2 + 3n + 1 )-S 2 + 6 (n + 1 )/S 3 - 6 S 4 
= 125(111.1) - 91(519.1) + 36(1,813.7) - 6(5,210.7) 
= 678.4 


From these, two trend lines may be computed, first a straight 
line, and second a third-degree polynomial, as follows: 
Straight-line trend: 


111.1 , 36.4, 
9 60 


y[ = 12.3 -1- 0.607f 


(origin at 1937) 


Third-degree polynomial trend: 

The normal equations are 

Xy = Na + bXt + cXt^ -b dXt» 
Xty = aXt + bXt^ + cXt^ -j- dXt* 
XPy = a2<2 + bXt^ + cXt* -b d2<« 
Xt^ = a2<® -b bXt* -b c2<6 -b dXt* 


in which all the sums of the odd powers of t are equal to zero, 
so that the equations for finding a, b, c, d are as follows: 
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(i) 

111.1 = 9o + 60c 



(ii) 

36.4 = 606 + 708d 



(iii) 

694.8 = 60a + 708c 



(iv) 

678.4 = 7086 + 9,780d 



(ii') 

429.52 = 7086 + 8,354.4d 

(ii) 

X 11.8 

(iv) - (iiO 

248.88 = l,425.6(i 


d = 0.17457 

(iii) 

694.8 = 60a + 708c 



(i') 

1,310.98 = 106.20 + 708c 

(i) 

X 11.8 

(i') - (iii) 

616.18 = 46.2a 


a = 13.34 


Substituting d in Eq. (ii), b = —1.45325 


Substituting a in Eq. (i), c = —0.14891 

The third-degree polynomial trend equation is thus 

= 13.34 - 1.45325« - 0.14891^" + 0.17457^* (origin at 1937) 

By using the method of finite differences to solve for various 
trend values, from Table 88 above, at ^ = 0, 

y2 = 13.34 

and 

AV 2 = -1.45325 - 0.14891 + 0.17457 
= -1.42759 

AV 2 = 2(-0.14891) + 6(0.17457) 

= 0.7496 

AVa = 6(0.17457) 

= 1.04742, 

which is a constant difference in this case. 

In Table 92, trend values are built up for the problem by using 
the method of finite differences. First, opposite < = 0, the 
value of y', the first, second, and third differences are entered. 
The constant third difference, 1.04742, is then subtracted suc¬ 
cessively in the —t direction (upward in the table); it is then 
added successively in the +t direction (downward in the table). 
For example, starting at ^ = 0, the second difference is 0.74960; 
the second difference at ^ = — 1 is 

0.74960 - 1.04742 = -0.29782 

the second difference at ^ = —2 is 


-0.29782 - 1.04742 = -1.34524; etc. 
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Starting again at i *= 0, the second difference is again 0.74960; 
the second difference at i = +1 is 

✓ 

0.74960 + 1.04742 = 1.79702; etc. ' 

The column of first differences is built up from the column of 
second differences. For example, starting at t = 0, the first 



N33 1934 1935 1936 1937 1938 1939 1940 1941 

Fig, 146. —Production of plate glass, polished, in the United States, 1933-1941. 
Straight-line and third-degree polynomial trends shown with raw data. 


difference is —1.42579; the first difference for ^ = —1 is then 
— 1.42579 — (—0.29782) = —1.12797, the first difference for 
= -2 is -1.12797 - (-1.34524) = +0.21727; etc. Again, 

Table 92.— Work Sheet for Finding Trend Values by Method of 
Finite Differences 

Equation of trend: = 13.34 - 1.45325^ - 0.14891^2 + 0.17457^3 
(origin at 1937) 


Year 



AW 

AW 

2 /2' 

1933 



-3.44008 

6.05001 

5.59 

1934 



-2.39266 

2.60993 

11.64 

1935 



-1.34524 

0.21727 

14.25 

1936 



-0.29782 

-1.12797 

14.47 

1937 

0 

1.04742 

0.74960 

-1.42579 

13.34 


1 


1.79702 

-0.67619 

11.91 

1939 

2 


2.84444 

1.12083 

11.24 


3 



3.96527 

12.36 

1941 

4 




16.32 
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starting at ^ = 0 with the first difference —1.42579, the first 
difference at ^ = +1 is —1.42579 + 0.74960 = —0.67619; the 
first difference at ^ = +2is —0.67619 + 1.79702 = 1.12083; etc. 
The values of are found from the first differences in exactly the 
same manner as the first differences from the second differences. 

The results of the trend analysis are shown graphically in 
Fig. 146. If it can be assumed that the period of 9 years covered 
by the whole period is a segment in a longer cyclical movement, 
the straight-line trend may be considered to measure a part of 
that longer cycle—part or all of its upward movement. The 
shorter cycle is then shown by the polynomial trend. Plate- 
glass production appears to have gone through one complete 
short cycle from about 1934 to about mid-1940. 



CHAPTER XXII 


ORTHOGONAL-POLYNOMIAL TRENDS 

f 

Great economy in trend analysis is secured by the use of 
orthogonal polynomials, especially if the trend desired is of 
higher degree than second-degree polynomial. It requires con¬ 
siderable space to explain and describe the method of orthogonal 
polynomials, which may seem to belie the fact of its economy in 
use, but the actual arithmetic of application is simple. When 
lines of regression involving more than three coefficients are 
fitted to time series by the least-squares criterion, the work of 
computation by the ordinary method increases very rapidly. 
Laborsaving devices introduced in the preceding chapter, includ¬ 
ing the use of the summation work sheet and the determination of 
2 ^2, etc., by formula, help to keep the amount of cal¬ 

culation at a minimum; but further reduction in the amount of 
calculation and particularly in the magnitude of the figures that 
have to be handled is obtained by using orthogonal polynomials. 

A ^^polynomiaL^ is an algebraic expression of the form 

d ht ct^ 

which, for example, is a polynomial in t of the second degree. A 
polynomial in t of the fourth degree would be 

a 4” 6^ “b ct^ 4" dt^ 4" 

and so forth. ‘^Orthogonar’ polynomials are polynomials that 
bear a certain relationship to each other, to be described below. 
The use of orthogonal polynomials involves merely a special 
method of computing the coefficients of a trend line; the method 
of fitting is still the method of least squares. 

One of the greatest advantages of using orthogonal-polynomial 
trends is that, if the investigator decides to fit either a higher or 
lower degree trend line than what he has already derived, the 
amount of work involved in these furlffier calculations is reduced 
to a minimum. In fact, no extra work at all would be required 
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to determine the equation for a trend line of lower degree, while 
the determination of an equation for a trend line of higher degree 
would require only the calculation of quantities pertaining 
directly to the added term and would not necessitate any recal¬ 
culations of other quantities. The work already done will 
therefore not be wasted. 

Orthogonal Pol 3 momials. Suppose a variable t has a set of 
values, say from 0 to 3. If each of these values is substituted in 
a polynomial in t, the polynomial will take on a corresponding 
set of values. Thus, if pi = ^ — 1.5 is a given polynomial in t, 
then, as t has the values 0, 1, 2, and 3, pi has values —1.5, 
—0.5, +0.5, and +1.5. Another polynomial in say 

P2 — — 3^ + 1 

will have a different set of values; in this instance, it will have 
the values 1, — 1, — 1, and 1 when t has the values 0, 1, 2, and 3, 
respectively. 

Orthogonal polynomials are those that bear special relation¬ 
ships to each other. The necessary condition for two poly¬ 
nomials to be orthogonal to each other is that the sum of their 
product for all values of t shall be equal to zero. That this 
necessary condition is met by pi = < — 1.5 and p 2 = + 1 

is readily seen. Thus, when ^ = 0, 

P 1 P 2 = 1.5) (^2 — -}- 1) = —1.5 

when ^ = 1, P 1 P 2 = +0.05; when t = 2, pip 2 = —0.05; and 
when ^ = 3, pip 2 = +1.5. Hence, 

^PiP2 = -1.5 + 0.5 - 0.5 + 1.5 = 0 

The polynomials pi == < — 1.5 and p 2 = + 1, accord¬ 

ingly, possess the orthogonal property. 

In general, if a set of polynomials in i, say pi, p 2 , ps, . . . , Pr, 
form an orthogonal set, then it is necessary that 

2piP2 = 0 Spips = 0 • • • Spipr = 0 1 

SP 2 P 3 = 0 SP 2 P 4 = 0 • • • Sp2Pr = 0 [ (1) 

Sp3P4 = 0 SpsPs =0 • • • SpsPr = 0 j 


These are the general conditions that must be satisfied by 
orthogonal polynomials. Notice that they are ecjuivalent to the 
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conditions that the correlation between each pair of polynomials 
is zero. 

Trend Line in Orthogonal Polynomials by the Method of Least 
Squares. The form of a trend-line equation that has so far been 
used is y' = a + bt + ct^ + dP . . . . This is an arbitrary 
form, however, and it is to be noted that other forms of the 
identical equation are possible. This can be illustrated numeri¬ 
cally as follows: 

The equation y' = 105.3 + 8.1^ — 0.7P is identically the same 
as = 115 + 6(fc — 1.5) — 0.7{t^ — + 1), which may be 

proved by multiplying out the expressions in the latter equation 
and collecting like terms. If the use of the second form has any 
advantage over the use of the first, there is no reason why it 
may not be adopted. 

Suppose, now, that instead of fitting a trend line in the form 
y' = a + bt + ct^ + dP, the fitting process is carried out with 
respect to the form 

2 /' = A + Bpi + Cp 2 + Dpz + Epi 

in which pi, p 2 , Ps, and p 4 are polynomials in t of the first, second, 
third, and fourth degree, respectively, that are orthogonal to 
each other and to unity, that is to say, where pi is a polynomial 
in t of the form pi = kio + tj p 2 is a polynomial in t of the form 
P 2 = *20 + * 2 i< + Pz is a polynomial in t of the form 

Pa *30 “j" *31^ “t" *32^^ "b tzy otc. 

and where 2)pi = 0, 2 )p 2 = 0, Sps = 0, Sp 4 = 0, and 2 pip 2 = 0, 
2 pip 3 = 0, Spip 4 = 0, Sp 2 P 3 = 0, etc. With reference to the 
arithmetical illustration given above, which was a third-degree 
polynomial, this is equivalent to deriving a trend line of the form 

y' = 115 + 6{t - 1.5) ~ 0.7{P - 3« + 1) 

instead of the usual form y' = 105.3 + 8.1^ — 0.7^^. 

Either method will, of course, give the same result; for, 
whichever form is derived, it can be converted into the other by 
simple algebra. It is the purpose of this section to show the 
simplification gained by using the orthogonal-polynomial form 
rather than the usual form. The problem of finding the forms 
of the polynomials themselves, i.e., the values of the * coefiicients, 
will be left for a subsequent section. 
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If a trend line is put in the orthogonal-polynomial form 

y' = A + Bpi + Cpi + Dpi 

and is then fitted by the method of least squares, i.e., if A, B, C, 
and D are determined so that 

2 ( 1 / - y'Y = S(j/ - A - Bpi - Cp2 - DpsY 

is made a minimum, the following conditions are obtained: 

2(y — A — Bpi — Cpi — Dps) = 0 

Spi(a — A — Bpi — Cpi — Dps) = 0 

2^2(2/ — A — Bpi — Cps — Dps) = 0 

Y.ps{y — A — Bpi — Cps — Dps) = 0 

or 

^y = NA+ B2pi + CSps + D2ps 

2piy = + B2pl -b C2pipt -t- DXpips 

2psy = A2pi -f- B2pips -b C2p\ -b D2psps 
2psy = A2ps + B2pips -b CXpsps + D2pl 


But since 1, pi, ps, and ps form an orthogonal set (by assump¬ 
tion), it follows that 2pi = 0, 2 p 2 = 0, 2 p 3 = 0, 2 pip 2 = 0, 
2 pip 3 = 0, and 2 p 2 Ps = 0. * Hence the above equations reduce 
to 


and 

and therefore 


2y = NA 

2piy = B2p? 

Xpsy = C2p\ 

2psy = I>2p| 



B = 


2piy 

2 p! 


„ _ 2psy 
^ 2pi 




The simple form of these solutions will be noted. It will also 
be noted that the solution for A is independent of pi, ps, and 
Ps and that the solution for B depends only upon pi, the solution 
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for C only upon p 2 , and the solution for D only upon This 
means that the value of A would have been the same whether a 
first-, second-, or third-degree trend line had been fitted. Simi¬ 
larly, the value of B would have been the same whether a first-, 
second-, or third-degree trend had been fitted, and the value of C 
would have been the same whether a second- or third-degree 
trend line had been fitted. For ii y' = A + Bpi had been fitted, 
the solutions would still have been 

If 2 /' = -d + Bpi + Cp 2 had been fitted, the solutions would 
still have, been 

A _ a and e = ^ 

N Dpi 

The addition of the term Cp 2 does not therefore change the 
values obtained for A or B, and the addition of the term Dps 
does not change the values obtained for A, J5, or C. It also 
can be seen that if a fifth term were added to the trend line, 
namely, Ep^j making it a fourth-degree trend, the value of E 
would be given by £7 = 'Zpj^y/^pX and the values of A, 5, C, 
and D would be the same as before. It is this simplicity and 
independence of the solutions of the least-squares equations 
when orthogonal polynomials are used that give the orthogonal- 
polynomial method its main advantage over the ordinary 
method. 

Forms of Orthogonal Polynomials Used. The forms of the 
orthogonal polynomials to be used for fitting trends can be 
generalized; what is required is to find the /c^s in terms of the 
given values of t and the number of years involved. The con¬ 
dition has been laid down that pi = kio + P 2 = ^^20 + ^ 21 ^ + Pj 
and pz = A; 3 o + * 31 ^ + kzit^ + etc., are to be polynomials of 
the first, second, and third degree in t, respectively, that are 
orthogonal to each other and to unity. The problem is to make 
use of this condition to determine values for the fc’s in terms of 
the given values of t. When this is done, it will be possible to 
find the actual values of A, B, C, and B, from the formulas of the 
preceding section. 
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By way of illustration, the forms of only pi and ^>2 will be 
derived; the method can be readily extended to the determination 
of the forms of ps and of higher polynomials. 

First, it is assumed that the time intervals T are measured 
from the mean T, so that pi and p 2 become pi = /bio + ^ and 
P2 = /b2o + k2it + where t = T — T, In addition, it is sup¬ 
posed that the time intervals to which the variable refers are 
equally spaced and without interruptions. According to these 
assumptions, t will have a mean of zero; its highest value will 

N — 1 . N — 1 * 

be H- 2 — lowest value- 2 —' example, if 

there are 5 years of data, the middle year will be 0, the first 
year —2, and the last year +2. If there are 4 years of data, the 

first year will be — 2 > second year — the third year + 

3 

and the last + 2 

Accordingly, all the odd moments of t, such as ^t/N, Xt^/N, 
and Xt^/Nj will be zero; the even moments, such as Xt^/N, 
Xt^/Ny and Xt^/N, are computable from simple formulas depend¬ 
ing entirely on Ny the number of years, as already noted in 
Chap. XXI.i 

With these assumptions, the derivation of the form of the 
orthogonal polynomials, that is to say, the derivation of the 
values of kio, A; 2 o, and k2iy may now be undertaken. The con¬ 
dition that pi, P 2 , and 1 shall be orthogonal to each other requires 
that 2pi = 0, 2 p 2 = 0, and 2pip2 = 0. These equations may 
be written as follows: 

Spi = 2(fcio + t) = Nkio + Xt = 0 (i) 

Xp2 = S(A;2o + k2it + P) = Nkzo + kiiXt + Xt^ = 0 (ii) 
Spip2 = 2p2(A:io + <) = *io2p2 + 2p2^ = 0 (iii) 

From these equations, the values of the /b’s can readily be 
obtained. Since Xt = 0, (i) gives Nkio = 0, or /fcio = 0; and 

Xt^ 

Eq. (ii) gives Nk 2 o + 2^^ = 0, or k 2 o = —From Eq. (ii), 

it is known that 2 p 2 = 0; hence, Eq. (iii) becomes Xp 2 t = 0. 
Substituting the equivalent of p 2 , this gives the condition, 

Xpit = S(*2o + k2it + t^)t = *202^ + k2iXt^ + 2^3 = 0 (iv) 

* These assumptions were made in the preceding chapter. C/. Tables 
84 to 86, Chap. XXI. 

1 C/. pp. 684,_586. 
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Since both Xt and are equal to zero, this becomes 

= 0 

and hence k 2 i must be zero, since 2^^ ig not. The values of the 
A;^s, therefore, are as follows: 

kiQ = 0 
^21 ~ 0 
, 2^2 
^20 = “■ 

and the forms of the polynomials pi, p 2 are therefore 

Pi = t 


for 

_ n(n - l)(2n - 1) 


3 

in which 

JV -b 1* 

” 2 

Hence, 

AT'* - 1 

N 12 

Accordingly, 

AT' - 1 


- ‘ 12 


(3) 


(4) 


Similar methods of analysis may be used to derive the forms 
of Pi and higher polynomials. The results obtained for poly¬ 
nomials up to the fifth degree may be listed as follows:* 


Pi = t 
Pi = 

p» = 

Pi = t* 

Vi = 


m - 1 
12 

SAT* - 7, 

20 

SAT* - 13 - l)(Ar* - 9) 

14 560 

5(Ar* - 7) , 15Ar« - 230JV* + 407 

18 1,008 


(5)' 


*(7/. Eqs. (10), Chap. XXI. 

* Cf. Fisheb, R. a., StatiMical Methods for Research Workers, Section 27. 
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Thus, it is to be noted that a trend line can be fitted in two dif¬ 
ferent forms, by the method of least squares; it can be fitted 
in the form y' = a + bt + ct^ + dt^ . . . (where t = T — T) 
by the methods described in the preceding chapter, or it can be 
fitted in the form y' = -^ + Bpi -f- Cp^ + Dpz + ... by the 
method of orthogonal polynomials. If the orthogonal-poly¬ 
nomial form is used in the fitting process, the ordinary form of the 
trend equation can readily be derived from the results; it should 
be repeated that the criterion of fit in each case is the least- 
squares criterion. 

Calculation of the Coefficients A, B, (7, . . . . If the values of 
Vh P 2 , pzy . . . given in Eq. (5) are substituted in formulas 
for A, B, Cf etc. [Eqs. (2)], the following values are 
obtained: 


A 

^ N 
B = 


NiN’‘ 


12 V 

p-1) L 


ty 


180 


N{N^ - i){N^ 
_ 2,800 




iV(JV=‘ - 1){N^ - 4)(iV2 - 9) 


44,100 


C = 

D = 

^ "" N{N^ - l)(iNr* - 4)(Ar2 - 9)(iV2 - 16) 

698,544 

N{N^ - l)(Ar» - 4)(iV* - 9)(iV* - 16)(iV‘“ - 25) 

15W‘ - 230iV‘ + 407 V . ^ 

1,008 4 V 


F = 


+ 


( 6 ) 


In order to illustrate the algebraic procedure by which the 
above formulas are obtained, the formula for C will be derived, 
as follows: 
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But. 


Pi = t^~ 


m - 1 
12 


Hence, 


c = 






2^2 ^ 2^2 _ I 

The formula for however, is —^ hence 


144 


S‘-= 


N(m - 1 ) 

12 


2^^ — 1 (3N^ — 7) 

Likewise, the formula for is —^—20—hence 


2 - 


JV(Ar2 - 1) (3Ar2 _ 7) 


12 20 
Therefore the denominator of (7 becomes 


N(N^ - 1) (3iV2 - 7) 2(iV* - 1) N{N^ - 1) (JV^ - 1)^ 

12 ■ 20 12 12 144 " 


Taking iV(iV® — l)/12 out of each term, 

iV(Ar2 _ 1) rsAT^ - 7 - 1) (TV* - 1)] 

12 [20 12 ' 12 J 

which readily reduces to 

N{N^ - 1) {m - 4) _ N(N^ - 1)(JV* - 4) 
12 ■ 15 180 


Thus C has the formula given above. The formulas for the 
other coefficients can be obtained in the same way. 
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Equations (6) could be applied by using a work sheet with 
columns for the product terms indicated in order to obtain 
etc. Greater economy is obtained, however, by 
using the subtotal summation type of work sheet illustrated in 
the preceding chapter. By using such a work sheet, an expe¬ 
ditious method that involves only addition and is self-checking 
has been evolved for finding A, B, C, . . . . A brief description 
of this method, together with the mathematical analysis that 
justifies its use, will now be given. 

a is defined as so that 
N 

Na = Si (7) 


and a' is defined as equal to a. Accordingly, 


A = a 


, 22 / Si 

“ =lv = V 


(i) 


From Eqs. (14), Chap. XXI, 

and, by the definition of a, 

^ N(N +1) V . 

02 = — 2 — 

If p is now defined as 

^ N{N + 1) 

then 

^ “ N(N + 1) S 

But 'Zty = Spij/, since pi = t; and if /J' is defined as 
(8' = 22pty/N{N + 1), 

= a-P' 


and 


p'= a-p 
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Since B = 


122pij/ 

N{N^ - 1 )’ 


it follows that 



(iv) 


Again, from Eqs. (14), Chap. XXI, it is found that 

in which a and may be substituted for equivalents, so that 

, N(.N + 1) 
2 

^ N(N + ly ^ ^ mN + ly + i\r(jv + d ^ , v 

But 

Hence 


Therefore, making substitutions in the above value of 2Ss, 

„„ N(N + 1)* . (A* - 1)A- , N(N + l)(iV + 2) 

2Ss= -j-« +-j2-« +-2-^ 


v„.. 
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„„ -ZN{N + ly + NiN + 1)(N - 1) 

J03 = -r;;- a 


12 


N(N + l)(jyv+ 2) 


2 


^ + 7 . PiV 


-N(N + l)(N±^ ^ N(N + l)iN + 2 ) ^ ^ 


6 

Now, if 7 is defined as 
y = 


N(N + 1)(N + 2) 


S, 


(V) 


then 


2NiN + l)iN + 2) -N(N + 1)(N + 2) 

-H- y = -s-“ 




And if y' is defined as 7 ' = ' i) (jy + 2) 2 

then 

2y = —a + 3/3 + 7' 

and 

7' = a — 3/3 + 27 (vi) 


and since C = 


180 


iV(iV2 - 1){N^ 2 

30 


p^y, it follows that 


C = 


(N - l)iN ~ 2) ' 

In the same manner, it can be shown that if 

. 24_ 

N{N + l)(Ar + 2){N + 2) " 

,, _ 20 _Y 

NiN + l)iN + 2)iN + 3) 4 


(vii) 


(viii) 


and 


PiV 


then 


«' = a - 6/8 + lOy - 58 

D = _ 140 

(N - 1)(JV - 2)iN - 3) 


S' 


(ix) 

(x) 
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As a result of the above analysis, from Eqs. (i), (ii), (v), and 
(viii), the following formulas are obtained: 


a 

y 

d 


e 

X 


N 

2 „ 

N(N + 1) “ 

_ 6 _ e 

N(N + 1)(N + 2) 


NCN + l)(iV + 2)(iV + 3) 


JV(JV + l)(iV + 2)(N + 3)(Ar + 4) 

720 

N(N + l)(Ar + 2)(iV + 3)(N + 4)(JV + 5) 



( 8 ) 


The values of e and of X are indicated by extension, since the 
symmetrical pattern of these formulas is readily apparent. 
The numerators run 2!, 3!, 4!, 5!, 6!, 7!, etc., and the denomi¬ 
nators run N, N{N -h 1), N{N + 1){N + 2), 

N{N + l)(iV 4- 2){N + 3), etc. 

Similarly, from Eqs. (i), 
formulas are obtained:^ 


= a-0 

y' = a — 30 + 2y 

S' = a - 6/3 -flOy 
e' = a - 10/3 4- 30r 
X' = a - 15/3 4- 707 

and from Eqs. (i), (iv), (vii), and (x), the following formulas are 
obtained; 

* For additional equations, see Fisher, op. eit., or George W. Snedecor, 
Statistical Methods (1940), pp. 324r-334, where the procedure is applied to 
problems of curvilinear correlation in which probability interpretation is 
valid. 


iii), (vi), and (ix), the follovring 


- 55 

- 355 4- 14e 

- 1405 4- 126e - 42X 


(9) 
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A 

B 

C 

D 

E 

F 


N - 1 


30 


d’ 


(N - IXN - 2) ' 

140 

{N - 1){N - 2){N - 3) 

630 

(iV - 1)(N - 2){N - 3)iN - 4) * 

2,772 

{N - 1 ){N - 2){N - ^){N - 4)(Ar - 5) 


( 10 ) 


X' 


Tables to Be Used in Orthogonal-polynomial Analysis to Save 
Calculations, All the explanation necessary for the application 
of the method of orthogonal polynomials to a problem has been 
given. Thus, from a work sheet providing the series of sums 
Sly S 2 , /S 3 , . . . , Eqs. ( 8 ) could be used to find the series a, jS, 
y, 6, ; from these, Eqs. (9) could be used to find the series 

a'y /3', y'y 5', , . , ; from these, Eqs. (10) could be used to find 
the series A, J5, C, . . . . The set of orthogonal polynomials 
fitting the data according to the least-squares criterion could 
then be written y' = A + Bpi + Bp 2 + Cpz + . . . . From 
Eqs. (5), values of pi, p 2 , Pz in terms of t could then be sub¬ 
stituted, and the final equation of trend in terms of t would be 
found. But it is desirable to effect another economy, by use of 
three tables of values that are the same for all problems having 
the same number of years of data. 

Thus, the use of Eqs. (8) will be greatly facilitated by the use 


of Table 93, a set of constants. 


N{N + 1) N{N + l)iN + 2) 

n ^ /> ^ 


etc., worked out for various odd values of N, that is to say, for 
various numbers of years, from 11 to 41. The use of Eqs. ( 10 ) 
will be greatly facilitated by referring to Table 94 for the various 

values of the senes of constants —^—> - ^0 - -k 

(AT - 1)(N - 2XN - 3) , , ^ ^ 

-jjq -etc. And the use of Eqs. (5) will 


be made easier by referring to Table 95 for the values of 
3Ar2 - 7 3iV^2 - 13 ^ 

20 ’ 14 ’ 



1 



Table 93. —^Values of Specified Variables Dependent upon the Number of Years Included in Trend Calculation 

Odd numbers of years, from 11 to 41 
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Table 95.—Values of Specified Vaeiables Dependent upon the Number of Years Included in Trend Calculation 

Odd numbers of years^ from 11 to 
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Advantages of Method of Orthogonal Polynomials. The method 
of orthogonal polynomials is a great timesaver whenever a trend 
of higher order than a second-degree polynomial is fitted. While 
it has required several pages to describe the method, it will be 
noted that the actual solution of a problem requires little more 
than a page of figures besides the work sheet. This is illus¬ 
trated in Chap. XXIV. 

But the saving of time is not the sole advantage of the method 
of orthogonal polynomials. In addition, the set of orthogonal 
polynomials that is obtained when values for .4, B, (7, D, . . . , 
are obtained, that is to say, 

y' = A + Bpi + Cp2 + Dpz + ‘ • 

constitutes the solution for any one of several trend lines. 
Thus 2 /' = -4 + Bp I is the straight-line trend; the addition of 
Cp 2 gives the second-degree polynomial trend; the addition of 
Dpz gives the third-degree polynomial trend, etc. It is not 
necessary to recalculate values for J., B, (7, . . . , for the various 
trends required. If a problem has been worked out to include 
solutions for -4, (7, and D and subsequently it is decided that 

E is required, it can be found by adding one more column to 
the work sheet and finding the value of E without recalculating 
the values of A, B, C, and D. 

This convenience ,of obtaining several types of trends from 
one orthogonal set comes from the fact that the terms of the 
orthogonal equation are linearly uncorrelated with each other. ^ 

1 See p. 600. 



CHAPTER XXIII 

TIME-SERIES ANALYSIS—SEASONAL VARIATION 

Historical Background. The second major stimulus to the 
development of methods for analyzing time series, listed at the 
beginning of Chap. XX, was the troublesome effects of seasonal 
variations in economic activity. Writers on labor problems 
stress the evil effects for labor of wide seasonal fluctuations in 
some employments. The effects of seasonal variations upon the 
banking and credit system were emphasized during the nineteenth 
century and the early part of the twentieth century. Even as 
early as 1793, Alexander Hamilton advised that redemption of 
the public debt be carried on during the winter, for, said he, 
“it is a familiar fact that during the winter in this country, there 
is always a scarcity of money in the towns—a circumstance cal¬ 
culated to damp the price of stock.^'^ 

Jevons made an analysis of the effects of the “autumnal pres¬ 
sureon the London money market and calculated the average 
monthly fluctuations in currency movement between the Bank 
of England and its branches (1855-1862) and the average 
monthly excess of payments or receipts of British coin at the 
Bank of England for the same period.^ In 1890, George Clare 
analyzed the seasonal variations for the period from 1881 to 
1890 in the circulation of the Bank of England, in public deposits, 
in “other deposits,^' in “other securities,^^ in the “reserve,^’ and 
in the “internal gold movements.'^® In 1902, J. P. Norton pub¬ 
lished a study of the New York money market in which he com- 

128th Congress, 1st Session, Executive Document, 15, p. 199. Cf. Myers, 
Margaret G., The New York Money Market, Vol. 1, Origins and Develop¬ 
ment, p. 208. Other early references to seasonal fluctuations are Hunfs 
Merchants' Magazine, Vol. 20, p. 302, Vol. 39, p. 582; Journal of Commerce, 
Aug. 3, 1846. 

* Investigations in Currency and Finance (Foxwell ed., 1909), pp. 158-159. 
Cf. Mitchell, W. C., Business Cycles—The Problem Staled and Its Setting,, 
(1928), pp. 199, 236. 

« A Money-Market Primer (2d ed.), pp. 19, 24, 31, 42, 53, 55. Cf. Smith, 
James G., Benjamin H. Beckhart, and William A. Brown, The New York 
Money Market, Vol. 4, External and Internal Relations, p. 424. 
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puted the seasonal variation in loans, the peaks occurring on 
Mar. 4 and in July and December and the low points occurring 
at the ibeginning of the year, in May, and at the end of Novem¬ 
ber.^ The outstanding statistical analysis of seasonal variations 
in the New York money market before the First World War is 
that prepared for the National Monetary Commission in 1910 
by Prof. E. W. Kemmerer.^ In this study he analyzed seasonal 
variations in money rates, exchange rates, bond yields, currency 
movements, and deposits. His analysis brought out the sea¬ 
sonal relationships in a striking jnanner, in spite of very strict 
limitations in available data at the time. Much of his work is 
based upon data gathered by the questionnaire method. 

Causes of Seasonal Variation. Two types of underlying forces 
cause seasonal variations in economic activity: (1) climatic con¬ 
ditions giving rise to seasons in agricultural production, in out- 
of-door construction work, in the manufacture of clothing, in 
the use of fuel, and in traveling, etc., and (2) forces arising from 
convention, such as the Christmas and Easter trade and sea¬ 
sonal style convention. 3 The effects of these various basic 
seasonal influences upon the New York money market and upon 
the banking and credit structure of the United States have 
recently been exhaustively studied and published in Vol. 4 of the 
previously mentioned studies of The New York Money Market, 
edited by Prof. Benjamin H. Beckhart of Columbia University.^ 

In large part the movement for banking reform in this country, 
which culminated in the studies of the National Monetary Com¬ 
mission and the Federal Reserve Act of 1913, was the result of 
the evil effects of seasonal fluctuations in the demands of trade 
giving rise to periodical stringencies in the money market and fre¬ 
quently initiating monetary panics. Consequently, it was one of 
the most important aims of the Federal Reserve System to devise 
an elastic currency and credit system that would accommodate 
these seasonal demands.® Thus banking reform in the United 

1 Statistical Studies in the New York Money Market, pp. 62-64. 

2 Seasonal Variations in the Relative Demand for Money and Capital in 
the United States (National Monetary Commission Publications), Vol. 22. 

3 Cf. Mitchell, op. cit., pp. 236-240. 

^ The New York Money Market, Vol. 4, External and Internal Relations,^ 
pp. 417-542. 

® The New York Money Market, Vol. 2, Sources and Movements of Funds, 
pp. 165-374. 



TIME-SERIES ANALYSIS—SEASONAL VARIATION 619 


States is a case in which a long-recognized evil was finally statisti¬ 
cally measured and evaluated and a reform in the system definitely 
resulted in improvement. 

Not only in the field of banking has the study of seasonal 
variation by statistical methods been stimulated. In addition, 
unemployment with all its economic, social, and psychological 
implications has aroused great concern about the measurement of 
such variation. Extended reference to the problem of seasonal 
unemployment was made at former President Hoover^s Con¬ 
ference on Unemployment, in the Report and Recommenda¬ 
tions of the Committee to Investigate Business Cycles and 
Unemployment. ^ 

In the hearings before the Committee on Education and 
Labor, of the United States Senate, in 1928-1929, much material 
and discussion are devoted to the subject of the seasonal varia¬ 
tions in employment in industries and trade. ^ Franklin D. 
Roosevelt, when governor of New York State, appointed a Com¬ 
mittee on the Stabilization of Industry for the Prevention of 
Unemployment, which made its report to him in November, 
1930, •entitled Less Unemployment through Stabilization of 
Operations, in which the subject of seasonal variations in 
employment constituted an important part. 

During the years leading up to the depression of the 1930^s, 
much was written on seasonal variation in employment and its 
contemplated stabilization. Thereafter, the problem of cyclical 
unemployment and its solution by means of unemployment 
insurance and the entire social security program dominated the 
scene. ^ 

iNew York, 1923, pp. 6, 116-120, 161, 215. 

270 th Congress, 2d Session, ‘‘Unemployment in the United States,^' 
S.R.219. 

® Smith, Edwin S., Reducing Seasonal Unemployment j The Experience of 
American Manufacturing Concerns (1931). Douglas, Paul H. and Aaron, 
Director, The Problem of Unemployment. This book devotes pp. 73-118 
to the subject of seasonal variations and regularization of industry to 
stabilize such fluctuations. Hansen, Alvin H., and Tillman M. Sogge, 
Seasonal Irregularity of Employment in Minneapolisj St. Paul and Duluth 
(Employment Stabilization Research Institute, November, 1931). Ber- 
RiDGE, W. A., “Employment and Income of Labor in the United States,^' 
in International Unemployment (a study of fluctuations in employment and 
unemployment in several countries, 1910-1930, Industrial Relations 
Institute, The Hague, Netherlands, 1932). 
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A familiar example of seasonal activity in the economic sphere 
is construction activity, which gives not more than two-thirds 
as much employment in the winter months, on the average, as 
in the summer. Some important manufacturing industries, too, 
such as the automobile, agricultural implements, and ready-made 
clothing industries, show a considerable seasonal fluctuation. 
To be sure, the busy season in some industries comes in the dull 
season for others, a fact that tends to level out the differences 
between the number employed in industry in its entirety in 
one month as compared with another. But this does not mean 
that the workers released by one industry are absorbed by 
another to a sufficient degree or with sufficient promptitude to 
obliterate the variations from month to month in the amount 
of their employment. Barriers of specialized skill, geography, 
and attachment to particular occupations and localities prevent 
anything like the dovetailing suggested by the figures of the 
total number employed.^ Consequently, the statistics of total 
employment may show little seasonal variation, while at the 
same time large degrees of seasonal unemployment exist in many 
parts of the total. The fact that there is no seasonal variation 
or little seasonal variation in total employment does not solve the 
unemployment problem for the seasonally unemployed worker. 

One reason why concern, statistically speaking, about the 
subject of seasonal variations in employment has been stimulated 
is because the opinion prevails that this particular type of 
unemployment is in large part avoidable. The movement to 
inaugurate unemployment insurance in the United States was 
partly based upon the belief that such a measure for the relief 
of unemployment would tend to regularize industries affected 
by seasonal unemployment. It is recognized that the greater 
problem of cyclical unemployment is less easily solved. The 
literature on the subject of unemployment insurance in the 
United States makes it clear that the movement is directed par¬ 
ticularly toward the regularization of industry to eliminate as 
much as possible of the seasonal fluctuation in employment.^ 

With these problems in mind, students of the labor problem 
asked: What types of business are responsible for the largest 

1 McCabe, David A., chapter on Unemployment, in Facing the Facts (a 
S3rmposium, 1932), pp. 324-325, 338-351. 

* Cf. McCabe, op. cit., pp. 344-346, 350. 
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part of this seasonal irregularity in employment? What are the 
peak and slack seasons of employment in different businesses, and 
what are the amplitudes of fluctuations? What can be done to 
make business less seasonal and less irregular? What is the cost 
of regularization plans in an industry, and how do such costs com¬ 
pare with the savings resulting from more regular use of capital 
investment? These are the types of exceedingly practical prob¬ 
lems presenting themselves in this field of economics, and they 
have stimulated statistical research to take measurements of 
seasonal variations. They are of practical significance to 
employers and to investors and to workers. They are of great 
social and psychological significance to the social scientist, the 
economist, and the political theorist. 

Methods of Measuring Seasonal Variations 

It has been seen that the method of discovering trends either 
for their own sake (rational trends) or in order to remove them 
from the data, i.e., to get rid of them (empirical trends), has been 
based upon curve-fitting technique. The technical problem 
involved is a simple one even though the mathematics may be 
complex in some cases. The simplicity of the idea is somewhat 
offset, however, by the irrational character of the procedure. 
This is a troublesome factor because it is the function of the 
statistician not only to apply mathematical analysis to statistics 
but also to explain what he does and why he does it. Enough 
has been included in Chaps. XX and XXI, to indicate the 
general character of this problem. 

In the case of seasonal variation, the difficulties of the statis¬ 
tician are just the reverse; for while it has been possible to build 
up a perfectly rational procedure, upon the basis of the theory of 
averages, the technical problem involved has been found to be 
a complex one. The rational concept underlying the procedure of 
measuring seasonal variation is that, where a time series has a 
characteristic seasonal variation occurring year after year, it 
should be quite reasonable to depict a ‘‘typical,’’ or average, 
seasonal variation for that time series. 

In its abstract aspect, therefore, the concept is perfectly 
rational. Homogeneous variates are to be averaged to obtain 
a type. For example, it is proposed to average the amount 
by which January data are higher or lower than those of other 
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months of the year, to average the amount by which February 
data are higher or lower than those of other months of the 
year, and so on, until a picture of the ^^type^^ of periodical move¬ 
ment that occurs every year is obtained, although each year may 
be slightly different from the type. Moderate variations from 
the type are quite consonant with the theory of averages and 
their application to the problem of measuring seasonal variation.^ 

When the rational procedure is to be put into effect, however, 
difficulties of a technical character arise. A time series of raw 
data that by a priori knowledge should have a distinctively 
regular seasonal variation may be selected. A graph of the time 
series is made, and a seasonal variation occurring every year is 
revealed, but the seasonal periodicity in the raw data is distorted 
by other movements, namely, trend and cycle. This was noted 
at the beginning of Chap. XX where a hypothetical time series 
was constructed. It is clear that the data in their raw state 
cannot be averaged to find the typical seasonal periodicity. 
That is to say, January, 1937, is not homogeneous with respect 
to seasonal variation with January, 1940, because the relative 
position of the respective Januaries (1) as to trend and (2) as to 
cycles is not comparable. In other words, averaging the raw 
data of all the Januaries in a series of data, all the Februaries, 
etc., for the 12 months of the year would be an irrational pro¬ 
cedure. This would not accord mth the rational idea of seasonal 
variation outlined above because the averages of raw data would 
include averages of something in addition to seasonal variation. 

Problem of Isolating Seasonal Variation, To average the 
actual seasonal variation, it must be isolated from the other 
types of variation in the raw data. The technical problem 
involved in the measurement of seasonal variation is thus how to 
isolate from the raw data that part of its fluctuation that is 
essentially seasonal in character. When these other types of 

1 If the seasonal variation were measured weekly, rather than monthly, 
the principle would be the same. The same principle may be used to 
measure periodicity by days within the month or within the week; and it 
may likewise be used to measure periodicity by hours within the day. 
Thus periodicity by days within the month of wage payments might have 
great economic value for some problems; and periodicity by hours within 
the day of consumption of electrical power might have significance in con¬ 
nection with some problems. Seasonal variation is only one type^ of 
periodicity that can be measured by this method. 
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fluctuation have been removed from the raw data, by subtraction 
or division, all the residual Januaries can be rationally averaged, 
all the residual Februaries can be averaged, etc^, in order to 
obtain a picture of the average position, respectively, of each 
month. 

How can this be done? There are several answers to this 
question, and there is controversy as to just what is the best 
technical procedure. In his notable studies of seasonal variation 
made about 1910, Prof. Kemmerer devised a method for measur¬ 
ing seasonal variation separate from other fluctuations. At the 
time, it was the best that had been suggested.^ 

Another famous suggestion as to a method of isolating the 
seasonal periodicities from other types of fluctuations was made 
by W. M. Persons, when from 1915 to 1919 he developed his 
approach to the problems of time-series analysis, culminating 
in the establishment of the Harvard Economic Society’s business 
barometer and the Review of Economic Statistics, Persons’ 
method, called the ^^ink relative method,” expresses each 
monthly figure as a relative of the immediately preceding month; 
the seasonal pattern is found by averaging all the link relatives 
for the same month and taking any residual trend out of the chain 
relatives computed from these average link relatives.^ 

A third method of isolating the seasonal fluctuations and 
measuring them by an index of seasonal variation is that advocat¬ 
ing simply the removal of trend from the data and then the 
averaging of the monthly ratio differences from the trend. ^ 
While this method removes the nonhomogeneous effects of trend, 
it does not remove those due to cyclical fluctuations. If taken 
over a sufficient period of time, the bias of the cyclical fluctuations 
will cancel so that a true index of seasonal variations would be 

1 Op. cit. Cf. criticism of Kemmerer^s method by W. L. Hart, Journal of 
the American Statistical Association, Vol. 17 (1922). Kemmerer’s work 
constitutes an important pioneer effort to solve the technical difficulties 
involved and helped direct attention to better solutions. 

^Review of Economic Statistics, January, 1919, pp. 18-31; Indices of 
Business Conditions (1919). Cf. Rietz, H. L., Handbook of Mathematical 
Statistics, pp. 151-155. 

3FALKNER, Helen D., ‘‘The Measurement of Seasonal Variation,'' 
Journal of the American Statistical Association, Vol. 19 (1924), pp. 167—179; 
Robb, Richard A., “Variate Difference Method of Seasonal Variation," 
{bid., Vol. 24 (1929), pp. 250-257. 
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obtained by averaging. One of the discoveries of recent years, 
however, is that seasonal variations change in the same time 
series from one era to another owing to new conditions; for such 
time series an index of seasonal variation based on a period of 
time covering two or more eras would be comparatively useless. 
Thus the criticism of the ratio-difference-from-the-trend method 
is that, if taken over a sufficient period of time to make it a 
valid measurement of seasonal variation, it would be taken for 
too long a time, i.e., that two or more eras of typical seasonal 
fluctuation might be confused. 

A number of other methods have been suggested, based 
upon the principles that have been outlined.^ The most widely 
used and probably the best method is the 12 months' moving 
average method, of which a number of refinements have been 
suggested. Since this method is the one most extensively used, 
it is now described in detail and an illustration will be given. 

Twelve Months^ Moving Average Method. This method 
consists of the following steps: 

1. Calculate a 12 months' moving average of the raw data, 
centering the moving average at the seventh month; thus, 
opposite July of the first year would be the average of the 
12 months of that year; opposite August would be the average 
of the last 11 months of that year and the first month of the next 
year; and so on. 

2. Divide the raw data serially by the 12 months' moving 
average. Inasmuch as the moving average would contain 
in it the elements both of trend and of major and minor cycles, 
the residuals of the raw data from the movii^g average (either 
by subtraction or division) would contain purely seasonal 
fluctuations. 

iKing, W. I., “An Improved Method for Measuring the Seasonal Fac¬ 
tor,” Journal of the American Statistical Association, Yol. 19 (1924), pp. 
301-313; Carmichael, F. L., “Methods of Computing Seasonal Indexes: 
Constant and Progressive,” ibid., Vol. 22 (1927), pp. 339-354; Jay, Arynbss, 
and Thomas Woodlief, “Use of Moving Averages in the Measurement of 
Seasonal Variation, ’ ibid., Vol. 23 (1928), pp. 241-252; Baumann, A. O., 
“Thirteen Months-Ratio-First Difference Method of Measuring Seasonal 
Variation,” ibid., Vol. 23 (1928), pp. 282-290; Kuznets, Simon, “Seasonal 
Patterns and Seasonal Amplitudes: Measurement of Their Short-time 
Variations,” ibid., Vol. 27 (1932), pp. 9-20; Riggleman, John R., and Ira 
N. Frisbee« Business Statistics (1932), pp. 226-242. 
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3. Make frequency distributions o| the several months (see 
the example at the end of the chapter). 

4. Find the median relatives for each month, using the median 
to avoid the influence of extreme fluctuations. 

5. Express these median relatives as a percentage of their own 
average, thus giving an index of seasonal variation. 

As a short cut, inasmuch as the result will be precisely the 
same, the 12* months’ moving total may be used instead of the 
12 months’ moving average, thus saving the division throughout 
by 12. 

Problem Illustrating Measurement of Seasonal Variation. 

Calcvlating the Index of Seasonal Variation by the 12 Months^ 
Moving Average Method, The time series of monthly data on 
consumer installment-sale debt for household appliances in the 
United States has been selected to illustrate the calculation of an 
index of seasonal variation by the 12 months’ moving average 
method. Table 96 is a work sheet for the calculations necessary 
to the problem. The data were recorded on this work sheet for 
the years 1929-’1942 by months, the raw data appearing in 
column (1). Next, a 12 months’ moving total was calculated; 
this appears in column (2), the moving total being ‘'centered at 
the seventh month.” For example, the figure 2,930 after July, 
1929, in column (2) of the work sheet is the total of the 12 
monthly figures for 1929; the figure 2,972 (opposite August, 
1929) is the total of the next 12 monthly figures, beginning with 
February, 1929, and ending mth January, 1930. Opposite each 
July is the total for that year; this constitutes a good cross 
check in the construction of the moving total. 

To calculate the moving total, first put the 12 monthly figures 
for 1929 in the adding machine, and take a subtotal; then sub¬ 
tract the datum for January, 1929, and add the datum for 
January, 1930, and take a subtotal; then subtract the datum for 
February, 1929, and add the datum for February, 1930, and 
take a subtotal; and so on, until the end of the time series. 
Clear the machine, and then add independently the last 12 
months of the time series; this should check with your last 
subtotal. If it does not check, a mistake has been made,' which 
can be most readily found by checking up on the July subtotals 
for each year, beginning with the last one and going back until 
you find the mistake. These subtotals are the 12 months^ 
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Table 96.—Work Sheet for Calculating Index of Seasonal Variation 
Data: Consumer installment-sale debt, monthly, for household appliances, 

end of month 
(In millions of dollars) 


Year and month 

(1) 

(2) 

(3) 

Monthly- 
raw data 

12 months’ 
moving total 
centered at 

7th month 

Raw data divided 
by 12 months’ 
moving total, 
per cent 

1929: 



t 

January. 

207 



February. 

199 

i 


March. 

199 



April. 

217 



May. 

237 



June. 

260 



July. 

273 

2,930 

9.32 

August. 

274 

2,972 

9.22 

September. 

272 

3,006 

9.05 

October. 

266 

3,031 

8.78 

November. 

261 

3,043 

8.58 

December. 

265 

3,043 

8.71 

1930: 




January. 

249 

3,031 

8.22 

February. 

233 

3,010 

7.74 

March. 

224 

2,984 

7.51 

April. 

229 

2,953 

7.75 

May. 

237 

2,919 

8.12 

June. 

248 

2,881 

8.61 

July. 

252 

2,838 

8.88 

August. 

248 

2,800 

8.86 

September. 

241 

2,766 

8.71 

October. 

232 

2,734 

8.48 

November. 

223 

2,701 

8.26 

December. 

222 

2,666 

8.33 

1931: 




January. 

211 

2,628 

8.03 

February. 

199 

2,588 

7.69 

March. 

192 

2,548 

7.54 

April. 

196 

2,509 

7.81 

May. 

202 

2,471 

8.17 

June. 

210 

2,434 

8.63 

July. 

212 

2,397 

8.84 

August. 

208 

2,357 

8.82 

September. 

202 

2,318 

8.71 

October. 

194 

2,275 

8.53 

November. 

186 

2,225 

8.36 

December. 

185 

2,167 

8.54 








































TIME-SERIES ANALYSIS—SEASONAL VARIATION 627 


Table 96.—Work Sheet for Calculating Index op Seasonal Varia¬ 
tion.— (Continued) 


Year and month 

(1) 

(2) 

(3) 

Monthly 
raw data 

12 months* 
moving total 
centered at 

7 th month 

Raw data divided 
by 12 months’ 
moving total, 
per cent 

1932: 




January. 

171 

2,101 

8.14 

February. 

160 

2,028 

7.89 

March. 

149 

1,954 

7.63 

April. 

146 

1,881 

7.76 

May. 

144 

1,813 

7.94 

June. 

144 

1,749 

8.23 

July. 

139 

1,685 

8.25 

August. 

134 

1,628 

8.23 

September. 

129 

1,575 

8.19 

October. 

126 

1,528 

8.25 

November. 

122 

1,484 

8.22 

December. 

121 

1,447 

8.36 

1933: 




January. 

114 

1,418 

8.04 

February. 

107 

1,398 

7.65 

March. 

102 

1,386 

7.36 

April. 

102 

1,378 

7.40 

May. 

107 

1,372 

7.80 

June. 

115 

1,367 

8.4; 

July. 

119 

1,365 

8.72 

August. 

122 

1,364 

8.94 

September. 

121 

1,365 

8.86 

October. 

120 

1,370 

8.76 

November. 

117 

1,384 

8.45 

December. 

119 

1,403 

8.48 

1934: 




January.^ 

113 

1,422 

7.95 

February.' 

108 

1,441 

7.49 

March. 

107 

1,456 

7.35 

April. 

116 

1,468 

7.90 

May.^ 

126 

1,479 

8.52 

June. 

134 

1,490 

8.99 

July. 

138 

1,502 

9.19 

August. 

137 

1,515 * 

9.04 

September. 

133 

1,528 

8.70 

October. 

131 

1,544 

8.48 

November. 

128 

1,561 

8.20 

December. 

131 

1,578 

1 8.30 
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Table 96.—Work Sheet for Calculating Index op Seasonal Var 
TioN.— (Continued) 



(1) 

(2) 

(3) 

Year and month 

Monthly 
raw data 

12 months' 
moving total 
centered at 

7 th month 

Raw data divid 
by 12 months 
moving total, 
per cent 

1935: 




January. 

126 

1,600 

7.88 

February. 

121 

1,626 

7.44 

March. 

123 

1,657 

7.42 

April. 

133 

1,692 

7.86 

May. 

143 

1,728 

8.28 

June. 

156 

1,768 

8.82 

July. 

164 

1,808 

9.07 

August. 

168 

1,845 

9.10 

September. 

168 

1,882 

8.93 

October. 

167 

1,921 

8.69 

November. 

168 

1,964 

8.55 

December. 

171 

2,017 

8.48 

1936: 




January. 

163 

2,075 

7.86 

February.x. 

158 

2,143 

7.37 

March. 

162 

2,211 

7.33 

April. 

176 

2,281 

7.72 

May. 

196 

2,353 

8.33 

Jun^.1 

214 

2,428 

8.81 

July.;. 

232 

2,512 

9.24 

August. 

236 

2,596 

9.09 

September. 

238 

2,683 

8.87 

October. 

239 

2,772 

8.62 

November. 

243 

2,863 

8.49 

December. 

1937: 

255 

2,952 

8.64 

January. 

247 

3,045 

8.11 

February. 

245 

3,130 

7.83 

March. 

251 

3,217 

3,302 

7.80 

8.09 

April. 

267 

May. 

285 > 

3,382 

8.43 

Jime. 

307 

3,451 

8.90 

July. 

317 

3,503 

,9.05 

August. . 

323 

3,552 

9.09 

September.. 

323 

3,594 

8.99 

October.. ... 

319 

3,624 

8.80 

November. 

312 

3,639 

8.57 

December. 

307 

3,636 

8.44 
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Table 96.— Work Sheet for Calculating Index op Seasonal Varia¬ 
tion.— {Continued) 


Year and month 

(1) 

(2) 

(3) 

Monthly 
raw data 

12 months’ 
moving total 
centered at 

7th month 

Raw data divided 
by 12 months’ 
moving total, 
per cent 

1938: 




January. 

296 

3,610 

8.20 

February. 

287 

3,571 

8.04 

March. 

281 

3,525 

7.97 

April. 

282 

3,475 

8.12 

May. 

282 

3,420 

8.24 

June... 

281 

3,371 

8.34 

July. 

278 

3,330 

8.35 

August. 

277 

3,290 

8.42 

September. 

273 

3,253 

8.39 

October. 

264 

3,218 

8.20 

November. 

263 

3,183 

8.26 

December. 

266 

3,154 

8.43 

1939: 




January. 

256 

3,153 

8.12 

February. 

250 

3,120 

8.01 

March. 

246 

3,110 

7.91 

April. 

247 

3,103 

7.96 

May.1 

253 

3,104 

8.15 

June. 

260 

3,106 

8.37 

July. 

265 

3,113 

8.51 

August. 

267 

3,119 

8.56 

September. 

266 ! 

3,124 

8.51 

October. 

265 

3,131 

8.46 

November. 

265 

3,143 

8.43 

December. 

273 

3,161 

8.64 

1940: 




January. 

262 

3,182 

8.23 

February. 

255 

3,205 

7.96 

March. 

253 

3,232 

7.83 

April. 

259 

3,259 

7.95 

May. 

271 

3,284 

8.25 

June. 

281 

3,309 

8.49 

July. 

288 

3,338 

8.63 

August. 

294 

3,366 

8.73 

September. 

293 

3,397 

8.62 

October. 

290 

3,430 

8.45 

November. 

290 

3,474 

8.35 

December. 

302 

3,523 

8.57 
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Table 96. —Work Sheet for Calculating Index op Seasonal Varia¬ 
tion.— (Continued) 


Year and month 

(1) 

(2) 

(3) 

Monthly 
raw data 

12 months’ 
moving total 
centered at 

7th month 

Raw data divided 
by 12 months’ 
moving total, 
per cent 

1941: 




January. 

290 

3,572 

8.12 

February. 

286 

3,620 

7.90 

March. 

286 

3,672 

7.79 

April. 

303 

3,721 

8.14 

May. 

320 

3,764 

8.50 

June. 

330 

3,794 

8.70 

July. 

336 

3,805 

8.83 

August. 

346 

3,809 

9.08 

September. 

342 

3,808 

8.98 

October. 

333 

3,794 

8.78 

November. 

320 

3,749 

8.54 

December. 

313 

3,670 

8.53 

1942: 




January. 

294 

3,559 

8.26 

February. 

285 

3,425 

8.32 

March. 

272 

3,262 

8.34 

April. 

258 

3,089 

8.35 

May. 

241 



June. 

219 



July. 

202 



August. 

183 



September. 

169 



October. 




November.•.. 




December. 





Source: Holthatjsbn, Duncan McG., “Monthly Estimates of Short-term Consumer 
Debt, 1929—1942,** Survey of Current Business^ Vol. 22 (November, 1942), pp. 9-25. 


moving total and can be tabulated in column (2) of the work 
sheet, as in Table 96. The next step is to divide each monthly 
raw datum by the corresponding moving total figure, expressing 
the answer as a percentage figure in column (3). The figures in 
column (3) are then tabulated in a system of frequency arrays 
as in Fig. 147. 

From Fig. 147 the median monthly relatives are read and 
arranged as in Table 97. 
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Fig. 147.—Frequency arrays, one for each month, of distributions of monthly 
ratios of raw data to 12 months’ moving total. Column (3) of Table 96. Con¬ 
sumer installment-sale debt for household appliances in the United States, 
1935-1942. 

Column (1) of Table 97 consists of the median relatives read 
from Fig. 147; and these median relatives have only to be 

Table 97.— Index op Seasonal Variation in Consumer Installment- 
sale Debt for Household Appliances in the United States 


Month 

Medians 

Index of seasonal 
variation 1 

January. 

8.13 

96.2 

February. 

7.95 

94.0 

March. 

7.82 

92.5 

April. 

8.05 

95.2 

May. 

8.25 

97.6 

June. 

8.72 

103.2 

July. 

8.85 

104.7 

August. 

9.10 

107.6 

September. 

8.88 

105.0 

October. 

8.64 

102.2 

November. 

8.50 

100.6 

December. 

8.55 

101.1 


Total. 

101.44 

1 ,200.0 

Average. 

8 .4533-t- 




1 This column consists of the medians expressed as percentages of their average. Thus 
8.13 is 96.2 per cent of 8.4533+. 

expressed as percentages of their own average to give the index 
of seasonal variation. This is done, giving the figures in column 
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tions and chance fluctuations. By averaging, the chance 
fluctuations are canceled out, leaving in the index a description 
of relative seasonal movement. The theory of this method is 
based, of course, upon the use of a 12 months’ moving average; 
but precisely the same arithmetical results are obtained by using 
the moving total instead, and it is a saving of a considerable num- 
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ber of division processes. In dividing by the 12 months^ moving 
total instead of by the 12 months’ moving average, the average 
residual percentage is 8.333+, whereas, in dividing the 
12 months’ moving average, the average residual percentage 
would be 12 times 8.333+, or 100.00. 

From the multiple frequency array, as in Figs. 147 and 148, it 
can be determined whether or not the seasonal variation is well 
defined. If the course of all the recorded ratios of raw data to 
the 12 months’ moving total by months tends to run close to the 
course of the medians, then the seasonal variation is a well- 
defined one. If, however, the points are scattered in a wide 
range from the medians and the general swing of the data does 
not correspond to the movements of the median line, then the 
seasonal variation is not well defined. Such a result might be 
obtained if the type of the seasonal variation were changing, and 
in that case the data may be studied in groups of a smaller num¬ 
ber of years. Figure 148 is included to present examples of 
poorly defined seasonal variation, as compared with well-defined 
cases of seasonal variation. The data studied are commercial 
paper rates in the New York money market before and after 
the inception of the Federal Reserve System. From the figure 
it is seen how well defined the seasonal variation in commercial 
paper rates was before the beginning of the Federal Reserve 
System—namely, for the periods 1904-1909 and 1909-1914. 
Also, it is seen how poorly defined is the seasonal variation for 
the periods 1920-1925 and 1925-1930—so poorly that there 
could hardly be said to have been any consistent seasonal 
periodicity whatever. ^ 

METHOD OF DETECTING CHANGING SEASONAL VARIATION 

Figure 149 is drawn to discover whether or not, during the 
years from 1929 to 1941, the seasonal variation in consumer 
installment debt for household appliances has changed.^ The 

1 For a more complete discussion see The New York Money Market^ Vol. 
4, pp. 510-530. 

2 For other suggested methods of measuring changing seasonal variation 
see Julius Shiskin, New Multiplicative Seasonal Index, Journal of the 
American Statistical Association, Vol. 37 (1942), pp. 507-516; Henry A. 
Latan4, Seasonal Factors Determined by Difference from Average of 
Adjacent Months,” Journal of the Avierican Statistical Association, Vol./ 
37 (1942), pp. 517-522; Dudley J. Cowden, ‘'Moving Seasonal Indexes,” 
Journal of the American Statistical Association, Vol. 37 (1942), pp. 523-524. 
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figure is a plot of the ratios to 12 months’ moving total shown 
in the last column of Table 96; a separate graph for each month 
has been drawn in Fig. 149. 

The straight-line trends for each month were drawn in ‘'at 
sight ”; they were not fitted by a mathematical method. Figure 
149 shows that the relative seasonal position of January, June, 
October, November, and December remained about the same 
during this period of years. But the relative seasonal amount of 
consumer installment debt was rising in the months of February, 
March, April, and May, while the relative seasonal quantity of 
consumer installment debt was declining in the months of 
July, August, and September. 

Consequently, a more refined index of seasonal variation 
than the average of a period of years such as that shown in 
Table 97 and Fig. 148 can be obtained from Fig. 149. In fact, 
since these trends exist, a different index of seasonal variation 
for each year is required. For 1942 this index of seasonal varia¬ 
tion can be obtained as indicated in Table 98. 

Table 98.— Computation op Index op Seasonal Variation in Consumer 
Installment-sale Debt for Household Appliances, 1942 


Month 

Ratios, read 
from trend lines 
in Fig. 149 

Index of 
seasonal 
variation^ 

January. 

8.10 

96.4 

February. 

8.09 

96.2 

March.. 

8.08 

96.1 

April. 

8.15 

97.0 

May. 

8.35 

99.3 

June. 

8.60 

102.3 

July. 

8.65 

102.9 

August. 

8.75 

104.1 

September. 

8.65 

102.9 

October. 

8.54 

101.6 

November. 

8.40 

99.9 

December. 

8.50 

101.1 


Total. 

100.86 

1,200.0 

Average. 

8.405 




1 Obtained by expressing ratios in the first column as percentages of their own average. 



















CHAPTER XXIV 

DETERMINATION OF CYCLE 


Usually it is desirable to have current figures on a monthly 
basis, and to know how actual experience compares with what 
should be expected for the season and with normal growth. 
Can we estimate our position in the business cycle from month 
to month? Annual data adjusted for trend and a picture of 
undistorted seasonal variation, illustrated in the preceding 
chapters, do not go quite far enough. It is often necessary 
to remove trend and seasonal variation from monthly data, in 
order to determine position in the cycle. 

Cycle Determined by Adjusting Monthly Data. When monthly, 
instead of annual, data are analyzed, the empirical trend may be 
found by setting up a work sheet similar to Table 83 or 86 (pages 
582, 588), depending upon the type of trend selected. The 
trend is then fitted by the method of least squares in a manner 
precisely similar to that demonstrated for annual data. Of 
course, if quite a number of years of monthly data are thus 
treated, the calculations become very extended, but the principle 
remains the same.^ It is possible, however, to derive an approxi¬ 
mation of the monthly trend equation from an annual trend 
equation. This is explained in the present chapter and may 
serve as an economizer of time in the analysis of monthly data. 

Determination of Cycle in Annual Data. While the purpose 
of this chapter is to present a method for measuring the cycle 
in monthly data, it may be noted at the start that even if the 
object of analysis is to determine cycle in monthly data it is 
desirable first to study the annual data. Not only is this true 
because the monthly trend may be easily estimated from the 
annual trend, but it is also desirable because the general character 

1 Where trends are calculated for monthly data, involving long series, 
convenient short-cut methods of calculation have been devised. Cf. Ross, 
F. A., ‘‘Formulae for Facilitating Computations in Time Series Analysis,*^ 
Journal of the American Statistical AssociatioUj Vol. 20 (1925), pp. 75-79; cf. 
also timesaving devices discussed in Chap. XXII. 
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of the results on a monthly basis can be visualized from the 
analysis of the annual figures and the annual data can be analyzed 
with a much smaller amount of computation. The analysis 
on an annual basis \vill help judge the kind of analysis required 
for the monthly data, whether to use a straight-line trend or a 
second- or third-degree polynomial trend. In addition, the 
analysis on an annual basis will help to decide the significance 
of the respective trends. This ^vill now be illustrated by making 
use of monthly and annual averages of monthly data on consumer 
installment-sale debt for household appliances in the United 
States, 1929-1942.* 

The work sheet is not reproduced here, but one similar to 
Table 91 (page 594) was constructed and the following set of 
subtotals was obtained: S\ = 2,842; = 18,159; Sz = 89,614; 

and Sa = 362,901 (in millions of dollars). Using the method 
of orthogonal polynomials, which is the quickest method of 
finding at once the first-, second-, and third-degree polynomial 
trend lines by one set of calculations, the following results 
were obtained by the application of Eqs. (5) and (8) to (10), 
Chap. XXII (pages 605-612): 


a = 


7 = 
5 = 


2,842 

13 

18,159 

91 


= 218.61538 
= 199.54945 


= 196.95384 
455 

362,901 
1,820 


= 199.39615 


The values of the denominators in the above fractions were 
obtained from Table 93 (page 613). These calculations are 
carried to more places than are significant for the problem because 
they must be combined in multiple proportions to obtain the 
following results by using Eqs. (9) and (10), Chap. XXII 
(pages 611-612) : 

* Holthausen, Duncan McG., “ Monthly Estimates of Short-term 
Consumer Debt, 1929-1942,” Survey of Current Business^ Vol. 22 (Novem¬ 
ber, 1942), pp. 9-25, 17. 
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a! = 218.61538 
iS' = 19.06593 
7' = 13.87471 
5' = -6.12367 


A = 218.61538 
B = .9.53296 

C = 3.15334 

D = -0.64948 


Accordingly, the following set of orthogonal-polynomial 
trends is obtained; the first line is the straight-line trend; the 
first line combined with the second line is the second-degree 
polynomial trend; the first, second, and third lines combined 
give the third-degree polynomial trend: 

2/' = 218.61538 + 9.53296pi 
+ 3.15334P2 
- 0.64948p3 

that is to say, since pi = f, p 2 = — 14, and ps = — 25t 

when N = 13 (see pages 605 and 615), 

y' = 218.61538 + 9.53296^ 

+ 3.15334(^2 - 14) 

- 0 . 64948(^3 - 250 

The three possible trends are therefore the following (in 
millions of dollars): 

Straight-line trend: 


2 /' = 218.6 + 9.533^ (origin at 1935) 
Second-degree polynomial trend: 

= 174.5 + 9,533t + 3 . 153^2 (origin at 1935) 
Third-degree polynomial trend: 

2 /'" = 174.5 + 25.77^ + 3.153^2 - 0.6495^3 (origin at 1935) 


Table 99 is presented to show the raw annual data and the 
annual values of each of these three trends, which are also 
presented graphically in Fig. 150. 

An annual increment averaging something less than 10 on a 
base of over 200 is not too great to deter the assumption that the 
straight-line trend roughly depicts rational long-term growth. 
If it is appropriate to suppose that installment-sale debt would 
tend to grow at the rate of population growth, which is pre¬ 
sumably geometric, the trend line fitted should be a logarithmic 



640 


STUDY OF DYNAMIC VARIABILITY 


Table 99.— Annual Trend Analysis of Consumer Installment-sale 
Debt for Household Appliances in the United States, 1929-1942 
(In millions of dollars) 


Year 

Raw data 

Straight-line 

trend 

. 

Second-degree 

polynomial 

trend 

Third-degree 

polynomial 

trend 

1929 

244 

161 

231 

274 

1930 

236 

171 

206 

206 

1931 

200 

180 

187 

163 

1932 

140 

190 

174 

143 

1933 

114 

200 

168 

141 

1934 

125 

209 

168 

152 

1935 

151 

219 

174 

174 

1936 

209 

228' 

187 

203 

1937 

292 

238 

206 

233 

1938 

277 

247 

232 

263 

1939 

259 

257 

263 

286 

1940 

278 

266 

301 

301 

1941 

317 

276 

345 

302 

1942 

♦ 

286t 

3961 

2871 


* Data available for only the first 9 months of the year; this yeai* was not used in fitting 
the trends. 

t Extrapolated. 



Fio. 1^0,—Annual trend analysis of consumer installment-sale debt for house¬ 
hold appliances in the United States, 1929-1942. 
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trend; but for a short period of years the straight-line trend is an 
adequate approximation of the more appropriate logarithmic 
trend. The straight-line trend in the data presently studied may 
consequently be used as the base from which the major cycle in 
the data can be measured. 

The second-degree polynomial trend shows a sharp rise for the 
later years, and if extrapolated beyond 1942 it would quickly 
approach infinity; it is not, therefore, a reasonable picture of 
rational growth. The straight line comes nearer to what would 
be the result if a logarithmic trend were fitted, if data covering 
a long enough period were available to afford sufficient per¬ 
spective to obtain a growth curve. 


Table 100.—Cyclical Movements in Consumer Installment-sale 
Debt for Household Appliances in the United States, 1929-1942 


Year 

Haw data, 
millions of 
dollars 
y 

Straight-line 
trend, millions 
of dollars 
y' 

Second-degree 
polynomial 
trend, millions 
of dollars 

yfff 


Cycle mixed 
with residuals, 
per cent 

y" 

( 2 /' = 100) 

1929 

244 

161 

274 

170 

152 

1930 

236 

171 

206 

120 

138 

1931 

200 

180 

163 

90 

111 

1932 

140 ' 

190 

143 

75 

74 

1933 

114 

200 

141 

70 

57 

1934 

125 

209 

152 

73 

60 

1935 

151 

219 

174 

79 

69 

1936 

209 

228 

203 

89 

92 

1937 

292 

238 

233 

98 

123 

1938 

277 

247 

263 

106 

112 

1939 

259 

257 

286 

111 

101 

1940 

278 

266 

301 

113 

104 

1941 

317 

276 

302 

109 

114 


The third-degree polynomial trend seems adequately to 
represent the rounded contour of a major cycle. Examination 
of Fig. 150, accordingly, leads to the conclusion that the straight- 
line trend can be used to depict growth in the data, and the 
' third-degree polynomial trend can be used to measure the major 
cycle. As a consequence, the raw data divided by the straight- 
line trend should give a picture of the major cyclical movement 
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in this period (plus the'^residual fluctuations);^ and the third- 
degree polynomial trend divided by the straight-line trend gives 
a measure of the major cycle. Table 100 and Fig. 151 give the 
results of such computations. The column headed Cycle in 
Table 100 consists of the second-degree polynomial empirical 
trend divided by the straight-line empirical trend, giving as a 
result a smoothed measure of the major cycle. This is shown 
by the heavy line in Fig. 151. The column headed Cycle mixed 



1930 1932 1934 1936 1938 1940 

Fig. 151. —Cyclical study of consumer installment-sale debt for household 
appliances in the United States, 192^1942. 

with residuals in Table 100 consists of the raw data divided by 
the straight-line empirical trend, giving as a result the measure 
of the cycle mixed with residual fluctuations in annual data.^ 
Both these columns are expressed as percentages, with the y' 
for each year equal to 100. 

Determination of Cycle in Monthly Data. Cycle Determined 
by Adjusting Monthly Data, Monthly data, when examined to 
discover the cycle, must be adjusted not only for trend but also 
for seasonal variations. Adjusting monthly data for trend 
and seasonal variation in order to measure cyclical movements 

1 In addition, there might be minor cyclical movement, a fact that could 
be determined by further analysis of data extending over a longer period of 
time. 

^ See p. 570 for meaning of residual fluctuations” in time series. In 
this instance, the residuals might include short-cycle fluctuations. See 
also Chap. XXV, pp. 659-661. 
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Table 101.— Work Sheet for Calculating Monthly Index op Cycle 
Description of Data: Consumer installment-sale debt for purchase of 
household appliances, United States 

Source of Data: Survey of Current Business, Vol. 22 (1942), pp' 0-25,17. 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

Year and month 

Raw 

data, 

millions 

of 

dollars 

V 

Monthly 

trend, 

millions 

of 

dollars 

y'* 

Index of 
seasonal 
variation,! 

per cent 
Av. = 100 

Monthly 
trend times 
index of 
S.V., millions 
of dollars 
y' X S.V. 

Cycle, 
per cent 
y 

y' X S.Y. 

1940: 






January. 

262 

262.0 

96.2 

252.0 

104.0 

February. 

255 

262.8 

94.0 

247.0 

103.2 

March. 

253 

263.6 

92.5 

243.8 

103.8 

April. 

259 

264.4 

95.2 

251.7 

102.9 

May. 

271 

265.2 

97.6 

258.8 

104.7 

June. 

281 

266.0 

103.2 

274.5 

102.4 

July. 

288 

266.8 

104.7 

279.3 

103.1 

August. 

294 

267.6 

107.6 

287.9 

102.1 

September. 

293 

268.4 

105.0 

281.8 

104.0 

October. 

290 

269.1 

102.2 

275.0 

105.4 

November. 

290 

269.9 

100.6 

271.5 

106.8 

December. 

302 

270.7 

101.1 

273.7 

110.3 

1941: 







290 

271.5 


261.2 

111.0 

February. 

286 

272.3 


256.0 

111.7 

March. 

286 

273.1 


252.6 

113.2 

April. 

303 

273.9 


260.8 

116.2 

May . 

320 

274.7 


268.1 

119.4 

June. 

330 

275.5 


284.3 

116.1 

July. 1 

336 

276.3 


289.3 

116.1 

August . 

346 

277.1 


298.2 

116.0 

September. 

342 

277.9 


291.8 ! 

117.2 

October . 

333 

278.7 


284.8 

116.9 

November. 

320 

279.5 


281.2 j 

113.8 

December. 

313 

280.3 


283.4 I 

110.4 

1942: 1 






January. 

294 

281.1 


270.4 

108.7 

February. 

285 

281.9 


265.0 

107.5 

March. 

272 

282.7 


261.5 

104.0 

April. 

258 

2a3.5 


269.9 

95.G 

IV/fay .1 

241 

284.3 


277.5 

86.8 

June. 

219 

285.1 


294.2 

74.4 

July . 

202 

285.9 


299.3 

67.5 

A . 

183 

286.7 


308.5 

59.3 

Sept**m^<^r . 

169 

287.4 1 


301.8 

56.0 







October. 






November. 






December. 







♦Equation of monthly trend: y' = 219.0 + 0.796« (origin July, 1935). 
t Necessary to copy this for only one year; this seasonal variation was calculated for 
illustrative purposes in Chap. XXIII, pp. 625-631. 
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involves the removal of trend and seasonal variation from the 
raw monthly data. The method is devised to produce an index 
or relative expression of raw data compared with what would be 
expected if only trend and seasonal variation were present in the 
data. Such analysis gives an answer to the question: How 
far are the raw data from month to month above or below what 
they would be if they were following the usual course of seasonal 
variation and the expected trend? In effect, the process is the 
reverse of that illustrated by a hypothetical series at the begin¬ 
ning of Chap. XX. The theory underlying the method of 
adjusting monthly data for seasonal variation and trend con¬ 
tains no tricks new to the student after mastering the material 
in the preceding chapters. It is quite simple in concept, though 
perhaps the arithmetical calculations involved are somewhat long. 

Illustration of Method of Determining Cycle in Monthly Data. 
It should be pointed out at the start that measuring the cycle 
in monthly data is measuring the same cycle as was measured 
in the annual data, if the same trend is used. In this illustration 
the raw monthly data will be adjusted for the straight-line trend 
and for seasonal variation, so that the resulting series of monthly 
data will correspond to the figures given in the last column of 
Table 100, except that they will be monthly instead of annual 
figures. The annual averages of the monthly adjusted data 
obtained in this illustration should be equal to the annual 
percentages shown in the last column of Table 100. 

Table 101 is a work sheet drawn up for the purpose of making 
the necessary calculations, on the assumption that (1) the index 
of seasonal variation and (2) the equation of trend on an annual 
basis have been calculated. The data used are monthly figures 
for consumer installment-sale debt for household appliances in the 
United States, 1929-1942. In the illustration only the period 
1940-1942 is presented. It would be a good laboratory exercise 
for the student to work out the rest for himself and plot the 
resulting adjusted figures. 

Column (2) of Table 101 contains the raw monthly data 
tabulated from the source.^ 

If a trend equation is in terms of annual figures, and the 
annual figures used are annual totals, the monthly trend equation 
will be 

1 Holthausen, op. cU. 
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I ^ \ ^ 4 

^ 12 144 ^ 


✓ 

But if the annual figures are annual averages of monthly data, 
then the monthly equation of trend will be 


/ . ^ . 
y =« + j2< 


Thus, b must be divided by 12 because in the monthly trend 
equation the annual increment is distributed among 12 parts 
(t now stands for months instead of years). In other words, if b 
is the annual increment, 5/12 is the monthly increment. But if 
b is the annual increment of total annual data (sum of the 12 
months each year), then to put it on a monthly basis it is neces¬ 
sary first to convert it to a monthly figure by dividing by 12; 
it is then still an annual increnient and has to be divided by 12 
again to obtain the monthly increment. 

In the trend equation, a can be assumed to be at June-July of 
the origin year, and of course the origin may be shifted by changing 
accordingly the value of a. The origin is at the middle of the 
year, t.c., between June and July. For example, the equation 
of trend found for the annual data on consumer installment-sale 
debt for household appliances is (in millions of dollars) 

2 /' = 218.6 + 9.53i (origin at 1935) 

The data used in this illustration are annual averages of monthly 
data; so the monthly trend equation is (in millions of dollars) 

2 /' = 218.6 + 0.796^ (origin at June-July, 1935) 
in which unit of ^ is 1 month. 

By adding algebraically half a monthly increment to 218.6, 
the origin is shifted to July, 1935, and the approximate equation 
of monthly trend is as follows (in millions of dollars): 

2 /' = 219.0 + 0.796^ (origin at July, 1935) 

Solving this equation for different values of t (from i = 54 at 
January, 1940, to « = 86 at September, 1942) gives the various 
monthly values of shown in column (3) of Table 101 under the 
caption Monthly Trend. Column (4) shows the index of sea- 
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sonal variation, which in Chap. XXIII was calculated by the 
12 months’ moving total method. Column (5) shows seasonal 
variation and trend combined by multiplication, and column (6) 
is obtained by dividing the monthly items in column (2), the 
raw data, by the monthly items in column (5). By this last 
operation, both trend and seasonal variation are removed from 
the raw data; the resulting index gives an idea of how high or low 
the raw data are in comparison to what they might be expected 
to be according to usual seasonal variation and trend. 

Data thus treated over a series of years disclose information 
about the time series that it is not possible to visualize from the 
raw figures. It makes possible the comparison between cyclical 
and minor cyclical fluctuations in time series otherwise con¬ 
cealed by disturbing elements of seasonal variation and trends. 
If this monthly analysis of the data on consumer installment-sale 
debt for household appliances in the United States were done 
for the entire period 1929-1942, the picture of monthly data 
would, of course, resemble the broken line of Fig. 151. The 
annual averages of the monthly data, which contain cyclical 
movements mixed with residual movements, would be equal to 
the figures shown in the final column of Table 100. In this con¬ 
nection it is to be noted that the annual averages of column (6) 
in Table 101 are equal to the corresponding annual figures in the 
last column of Table 100; for 1940 the annual average of the 
figures in column (6) of Table 101 is equal to 104, and for 1941 
the annual average of the figures in column (6) of Table 101 
is equal to 114. 

From the results of calculations in Table 101 it may be con¬ 
cluded that consumer installment-sale debt for household 
appliances reached the peak of a cycle in May, 1941, remaining 
10 to 17 per cent above normal throughout 1941. In 1942 a 
sharp decline materialized; in fact, this decline, on a monthly 
basis, was rapid after October, 1941. The raw data appear to 
indicate that the peak of the cycle occurred in August, but this is 
due to the effect of seasonal variation and trend. When sea¬ 
sonal variation and trend are taken into consideration, the 
cyclical peak is found to be in May, 1942. From July, 1940, to 
August, 1940, the raw data show an increase, but a cyclical 
decline occurred in that period. Removal of seasonal variation 
and trfend makes it apparent that the appearance of a rise from 



DETERMINATION OF CYCLE 


647 


July, 1940, to August, 1940, was due to seasonal influences and 
trend. 

Adjustment of data by removing seasonal variation afid trend 
makes it possible to judge quickly whether or not consumer 
installment-sale debt for household appliances is rising (or falling) 
more rapidly than seasonal variation and trend would lead us to 
expect. The resulting figures are frequently described when 
published by saying that the data are ‘‘adjusted for seasonal 
variation and trend.Sometimes, if trend is unimportant or 
of dubious character, only seasonal variation is removed and the 
data are described as “adjusted for seasonal variation.^' Charts 
of such data appear frequently in financial publications and in 
the financial sections of metropolitan newspapers. 

Measuring the Cycle Where Trend Is a Second- or Third-degree 
Polynomial. The rational growth of some data is better described 
by a second-degree polynomial, as discovered in Chap. XXI. 
When such is the case, it would be necessary to use a second- 
degree polynomial instead of a straight-line trend as illustrated 
in the preceding sections of this chapter. 

A third-degree polynomial is not likely ever to resemble a 
rational growth element in a time series, but it may resemble the 
conformation of the major cycle during a specified period covered 
by data that are being analyzed. If it is desired not only to 
remove growth trend but also to remove from the data the effects 
of the major cycle in order to observe residuals that might be 
significantly described as a short cycle, the method described in 
the preceding sections could be used mth monthly data, apply¬ 
ing the same principles to the removal of a third-degree poly¬ 
nomial trend combined with seasonal variation that were applied 
to the removal of a straight-line trend combined with seasonal 
variation. 

The general form of the second-degree polynomial annual 
trend is y' = a + bt + cP where t is 1 year. The equation of the 
monthly second-degree polynomial trend, where the annual 
data are annual averages of monthly data, would be 


where ^ is 1 month. 
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The general form of the third-degree polynomial annual trend 
is y' = a + bt + ct^ + dt^ where t is 1 year. The equation of 
the monthly third-degree polynomial trend, where the annual 
data are annual averages of monthly data, would be 

in which tis 1 month. 

If the data are annual totals, instead of annual averages, 
every item on the right side of the respective equations mil be 
divided by 12.* 

Danger in Extrapolating Trends, In the illustration on page 
641 it was noted that the second-degree polynomial trend, if 
extended beyond the year 1941, would quickly go up to infinity. 
Thus the extrapolation, or extension, of this trend beyond 1942 
very soon becomes an absurdity. This shows the need for 
caution in the projection, or extrapolation, of empirical trends.^ 
Their projection for short periods of time (how long depends 
. upon the conditions of each particular case) is a valuable aid in 
constructing barometric indexes. 

A troublesome unsolved problem in time-series analysis is to 
know when trend is changing and also, for that matter, when 
seasonal variation may be changing. Neither the statisticians 
nor the economists have solved this problem, but they realize 
that it is ever present in time-series analysis. It is desirable, 
therefore, to be cautious about extending empirical trends into 
the future and to reexamine monthly data for seasonal variation 
at frequent intervals. A method for detecting changing trends 
in seasonal variation was explained and illustrated in the pre¬ 
ceding chapter. 

Method of Ratios vs. Method of Differences, In general, the 
method here presented for removing one or more types of varia¬ 
tion from time series has been the method of division, or ratios. 
In other words, the raw data are expressed as percentages of 
computed trend and seasonal variation. This is' not the only 
method of removing trend and seasonal variation from the 
monthly or annual raw data. Another type of approach is called 

* See pp. 644-645. 

^ For further discussion in connection with economic forecasting, see 
Chap. XXV, pp. 661-671. 
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the ‘^method of differences/^ which, in one of a number of forms, 
may be summarized as follows: 

1 . Assuming that the index of seasonal variation^and the 
monthly trend have been calculated by one of the conventional 
procedures, the monthly trend values are multiplied by the 
index of seasonal variation. 

2 . The trend multiplied by seasonal variation (y^) are now 
subtracted from the raw data (yi). 

3. This gives a series of yi - y[, y^ - y^, . . . , being the 
arithmetical amount, in original units (pounds, dollars, etc.), 
by which the raw data are greater each month, or less, than the 
computed value for trend multiplied by the index of seasonal 
variation. These residuals of the raw data from trend and 
seasonal variation form a series ri, ^ 3 , . . . , that would be 
a time series fluctuating arithmetically above and below zero, 
according to whether the raw data were above or below trend 
multiplied by seasonal variation, that is to say, according to 
whether the raw data were greater or less than values expected 
in view of the anticipated growth and seasonal fluctuation. 

4. These residuals are in terms of the quantity units of the 
raw data. Consequently, it would be very difficult to compare 
the residual fluctuations of a series measured in bushels (say 
wheat production) with the residuals in dollars (say the price of 
wheat). It is necessary to find a common denominator in order 
to compare the residuals in various time series, obtained by the 
arithmetical difference method. 

5. The common denominator used is the standard deviation 
in the residuals. 0 -^ mil be simply \/Sr^/A, since their arith¬ 
metical average is zero. Each r divided successively by < 7 ^ would 
give a series in terms of standard-deviation units that could 
thereafter be compared with other time series similarly treated. 
Various series, whether the original units were dollars, pounds, 
inches, etc., will now be reduced to terms of standard-deviation 
units and can all be plotted on the same scale, namely, a scale 
that is calibrated in standard deviations. 

One important disadvantage in the method of differences 
persists even after the residuals are expressed in terms of their 
own standard deviations. The redduals will tend to be arith¬ 
metically greater wh^n trend is at high values and arithmetically 
small whea trend is at low; values. This means that the impres*' 
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sion will almost universally arise from such an analysis that as 
things grow they become subject to more violent fluctuations. 
In fact, when considered relative to the magnitudes from which 
variation occurs, these greater arithmetical variations may be 
really less important. The ratio method places at all times the 
proper proportional enlphasis upon arithmetical fluctuations by 
expressing them as a ratio to the trend and seasonal variation. 

On the other hand, by the same token the ratio method may 
tend to minimize the importance of fluctuations; for it may be 
possible that the proportional amount of change is not so sig¬ 
nificant as the actual amount of change. For example, the fact 
that the amount of unemployment is no greater proportionally 
may not necessarily dispose of the fact that the actual amount of 
unemployment at some particular time is very great and the 
corresponding personal problems distressing in the extreme. 

Whatever method of statistics is used, it is necessary for the 
analyst to keep his eyes open to the effect the method itself may 
have upon his results. 



PART VI 
Forecasting 

CHAPTER XXV 

THE ART OF FORECASTING WITH STATISTICS 

INTRODUCTION 

Prevalence of Forecasts. Ancient Origin of Pseudoscientific 
Forecasts. The human desire to look into the future led, even 
in ancient times, to the rise of various forms of pseudoscientific 
forecasts. Oracles were frequently consulted as to the outcome 
of a contemplated military campaign, business venture, or love 
affair. Among the most famous of these was the Delphian 
oracle. Astrologists Avere, and still are, consulted for what the 
stars have to say; one of their most prominent devotees in 
modern times is said to be Adolf Hitler. 

It was partly to disprove some of these astrological notions 
that statistical method was first undertaken on a scientific 
basis. In the seventeenth century an idea prevailed that the 
phases of the moon influenced health; also, health was supposed 
to be critical every seventh year and life particularly hazardous 
at the ages of forty-nine and sixty-three. Near the end of the 
seventeenth century studies of vital statistics by Capt. John 
Graunt of London and Casper Neumann of Germany disproved 
the connection betAveen health and the phases of the moon as 
well as the fateful significance of every seventh year in life. 
Other similar superstitions Avere debunked by statistical 
studies. From the beginning of the history of the modern 
money market, attempts have been made to devise some way 
to forecast the course of financial affairs. For the Antwerp 
Bourse in 1543, Christopher Kurz is said to have contrived an 
astronomical method of making prophecies about the money 
market.^ . 

1 Ehrenberg, E., Capital and Finance in the Age of the Remissance, 
p. 240. 
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Modern Scientific Forecasts, Although forecasting was thus 
once the special prerogative of soothsayers, today it has been 
placed upon a broader basis by the development of science. 
For one of the objectives of science is, precisely, to forecast. 
Science seeks to classify and determine relationships that may be 
used for purposes of prediction. Every scientific law is, in a cer¬ 
tain sense, a forecast. It foretells what will happen under certain 
circumstances. The law of gravitation says, for example, that 
if a ball is dropped from a tall building it will fall with an acceler¬ 
ation of 32 feet per second per second. Boyle^s law says that 
the pressure in a given container varies, and will vary, directly 
with the temperature and indirectly mth the volume. Scientific 
astronomy makes it possible to forecast the tides, to construct the 
calendar for our mundane affairs, and, in addition, to forecast 
celestial events such as the date of the next visit of Halley's 
comet. There are no ifs " or buts " about the modern scientific 
forecasts in the realm of the natural or physical sciences. 

Popular Dramatization of Forecasts, The depression of the 
1930's did more than hundreds of books could have done to make 
people cycle-conscious. So general was the interest in cyclical 
behavior that by 1940 the Foundation for the Study of Cycles 
was set up as a nonprofit organization with an international 
committee composed of scientists and businessmen. This 
foundation proposed to help in the task of integrating the work 
of the thousands of scientists and statisticians who are con¬ 
tributing in various fields to the study of cycles. Not on^ have 
cycles been found to exist in the realm of business activity, but 
scientists in many other fields believe they have discovered 
cyclical behavior in their respective studies. For example, 
psychologists have discovered that human beings have regular 
ups and downs in their emotional life, following a cyclical 
pattern. Biologists have discovered what appear to be regular 
fluctuations in animal, insect, bird, and even fish populations. 
In 1937, Prof. William Hamilton of Cornell University, upon the 
basis of cycle studies, warned farmers and housewives of New 
York State to prepare for a scourge of mice in the winter of 
1939-1940 and for another outbreak in 1943-1944. While it 
may still be too early to put the stamp of final scientific approval 
upon all these cyclical discoveries, they are nevertheless making 
important contributions to knowledge. 
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Some of the twentieth-century discoveries sound almost like 
the pseudoscientific superstitions of the Middle Ages. Thus in 
1943 the public was advised ‘‘to look for skunks under your 
front porch about 1945.^’ It was claimed that an answer could 
be given to such questions as: Will you feel happy or gloomy a 
month from today? Such statements were made as: If you 
are born in January, February, March, or April, the chances 
are you will live longer than people born in July, August, or 
September.^ These notions about forecasting suggest a precision 
in statistical forecasts that they probably will never possess.^ 

Conditional Scientific Forecasts. A forecast, to be scientific, 
does not have to be unconditional; in fact, most forecasts in the 
realm of the social sciences and some in the realm of the physical 
sciences are hypothetical in character. Indeed, in its largest 
sense, forecasting must be taken to mean prediction of not only 
what will happen but what would happen under given hypo¬ 
thetical conditions. Not only must the predictions of the 
meteorologist and stockbroker be considered forecasts, but also 
predictions of the engineer as to the outcome of certain plans 
and the warnings of the economist as to the effect of certain 
proposed actions of Congress are forecasts. The latter are 
conditional forecasts. 

Many predictions of coming events are hedged in by all sorts 
of weasel-like conditions. It may be said that private enterprise 
will di^ppear if Republicans are not elected. Or an economist 
may predict that a Congressional increase in tariff rates will 
cause exports to decline, provided that foreign countries do not 
offset our higher tariff by giving bounties to their exporters, or 
that foreign demand for American products does not increase 
for some unforseen reason, or that American exports do not 
become less costly to produce. Such forecasts are conditional, or 
hypothetical, in character. 

The practical worth of a forecast depends, not on whether 
it is conditional or unconditional, but on how much knowledge 
the forecaster actually has of the relevant conditions. An uncon¬ 
ditional forecast may be merely a wild guess and have little 

^ Dewey, Edward Russell, ‘‘ Science Predicts the Future,” American 
Magazine, V61. 136 (1943), pp. 90-92. 

* See More Exact Forecasting and Less Exact Forecasting, pp. 659-661. 
See also Smith Rnd Dvncan, Sampling Statistics and Applications, 
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value, in spite of its uncompromising and categorical appearance, 
while a carefully drawn conditional forecast may be of great value, 
in spite of its ‘‘pussyfooting^' aspect. In the case of the latter, it 
may be that the likelihood of the conditional factors is very 
slight and that they are mentioned only to guard the forecaster 
from unwarranted criticism. On the other hand, if the disturb¬ 
ing factor has a fair likelihood of occurrence, the nature of its 
effect might be forecast so that the recipient of the forecast 
could be on his guard against this factor; by watching it, he 
might know when to abandon his faith in the original forecast. 
For example, if a prediction of rain tomorrow is based merely 
on the fact that it looks somewhat cloudy today, the forecast 
would probably be of little value (in the sense that such forecasts 
would probably be wrong more often than they were right). 
In contrast, if a trained observer predicted rain after a thorough 
observation of the weather situation, this would have consider¬ 
able value even if he hedged his prediction by saying that the 
rain might not occur if the wind in a neighboring area shifted 
before a certain time. 

Qualitative vs. Quantitative Forecasts. Most forecasts are 
qualitative in character. The meteorologist says it will rain 
but does not always say how heavily. The economist may 
predict that the effect of an increase in the tariff will be to raise 
prices, but he does not often say to what degree. The meteorol¬ 
ogist, on the contrary, may give the approximate time \’^en rain 
is expected and how many inches are expected to fall;%nd the 
economist may try to estimate the average foreseen rise in prices. 
The latter would be quantitative forecasts. 

It will be noted that forecasts may be quantitative in two 
ways, with reference to the degree of the predicted change and 
with reference to the time of occurrence. The success of fore¬ 
casting must be judged, not only on the basis of whether the 
forecast was correct, but also on how far the forecast went in 
actually describing the future event—^its quantity and its timing. 

Illustrations of Modern Forecasts. In the modern world, 
forecasts are applied in many fields. Predictions of astronomical 
events, as already indicated, have been among the earliest and 
most successful forecasts. The movements of the moon, the 
planets, and other heavenly bodies have been computed with 
considerably accuracy so that their future course may be pre- 
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dieted with great precision. Forecasts of certain eclipses, 
for example, have been only a few seconds in error in timing. 
In this connection, it is interesting to note that the theory of 
least squares was largely developed in the attempt to forecast 
the paths of the heavenly bodies. 

Closely akin to astronomical forecasts have been forecasts 
of weather conditions. Short-range forecasts are based mainly 
on wind conditions and barometric pressures, but long-range 
forecasts are sometimes attempted from the study of rainfall 
data, sunspots, and the like. In some instances, studies of 
growth rings in old trees have yielded weather data going back 
many years. These studies usually look for cyclical fluctuations 
that will indicate periods of high and low activity and permit 
long-range forecasting. Studies of average weather conditions 
and the dispersion around these averages also afford forecasts 
of the variability of conditions in different areas and hence 
suggest the more desirable airports and air routes. 

Engineers make many forecasts. A water-power engineer 
vdll forecast the amount of power to be obtained from a dam of 
given size built in a given river. Another engineer may predict 
the breaking strength of given kinds of mre at different tempera¬ 
tures. Still another may predict the maximum load to be 
sustained by a given bridge. 

From the laws of Mendel, biologists make predictions of 
results to be expected from crossbreeding. Agronomists will 
predict the average results to be obtained from the use of certain 
fertilizers, or certain methods of cultivation, or certain varieties 
of crops. Agricultural economists attempt to predict the 
effect of certain sized crops on the future prices of important 
commodities or the effect of certain prices on the future volume 
of production. 

Business economists attempt many kinds of forecasts. From 
studies of factors closely related to the sale of a certain product 
in a given area where the trade has been well established, fore¬ 
casts may be made of the sales to be obtained from new untapped 
areas of similar character. Other economic forecasts aim to 
predict the ups and downs of the business cycle in various lines 
of activity. Probably the greatest percentage of economic fore¬ 
casts are devoted to predictions of the stock market—^money 
rates, bond prices, and security prices. 
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These are but a few of the examples of forecasting. It is 
probably true that forecasting in all its ramifications is pandemic 
in modern life. 

Use of Statistics in Forecasting. This chapter attempts to 
outline the use that may be made of statistical analysis in making 
forecasts. Details of the methods of forecasting are beyond the 
scope of this volume, which is not a book on forecasting but 
merely includes a chapter on the pattern of methods used in fore¬ 
casting. A few examples will be given to illustrate these 
methods. The aim here is primarily to indicate the application 
of different statistical techniques to the problems of forecasting. 

Statistics affords a basis for forecasting in two principal 
ways. (1) By studying monovariate and multivariate frequency 
distributions, statistics are used to forecast average results and 
the type and degree of dispersion around these results. (2) By 
means of time-series analysis, statistics are used to predict the 
course of events in time. Each of these applications to fore¬ 
casting will be discussed in the ensuing pages. 

FORECASTS FROM DISTRIBUTION STUDIES 

Forecasts from Monovariate Distributions. If considerable 
data have been obtained, forecasts from monovariate distribu¬ 
tions may yield good estimates of the mean, standard deviation, 
coefficient of skewness, and kurtosis of the population from 
which the data were derived. If such is the case, these popula¬ 
tion estimates may be used to forecast the character of future 
samples drawn from the given population. 

Familiar matters relating to family care and health conven¬ 
tionally rely upon forecasting by use of monovariates. Suppose 
the frequency distribution, or monovariate, represents the 
weights of boys of specified age. The mean of that distribution 
is presumably normal weight for that age; the standard devia¬ 
tion and kurtosis describe expected variability. From such a 
monovariate and its statistics, it is commonly inferred whether or 
not a child is under normal weight and, if so, whether or not this 
deficiency is sufficient to cause alarm. Taken with other 
evidence it may be the basis for the application of timely thera¬ 
peutic action. ^ 

In social control, monovariate distributions are used to 
standardize products involving the presumption of forecasting. 
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The fat content and standard deviation in fat content of tyiiIIt 
that has been produced and sold in the past constitute a set of 
standards according to which it is ruled that milk sold in the 
future will conform; thus milk is graded according to standards 
found by frequency-distribution analysis. Methods of weight 
or content are used by the Bureau of Standards to set standards 
for many products, both in the raw state, such as grains of wheat, 
and in the final product, such as bread; and abnormal variations 
from these standards in the market are not permitted. 

In business the use of monovariate distributions for fore¬ 
casting is widespread. For example, the distribution in sizes 
of shoes sold by a retailer is used as the basis for forecasting his 
future sales and for determining his reorders of additional shoes. 
In such forecasting, the businessman is interested in forecasting 
each class in the distribution rather than in the distribution's 
average and standard deviation. A similar forecasting pro¬ 
cedure is used by any retailer when he purchases articles that 
are sold by size, which include most articles of clothing. The 
wholesaler and the manufacturer also are interested in the same 
type of forecasting, so that they may profit by having the 
appropriate number of articles of various sizes continually 
ready for the consumer—^if the articles are there, ready for him 
to buy when he comes, a minimum of consumers^ sales will be 
lost.^ 

Forecasts from Bivariate Distributions. Bivariate data may 
likewise yield estimates of a bivariate population that may make 
it possible to forecast results of future samples. Suppose, for 
example, that it is found from a study of army records that there 
is a high correlation between the Army General Classification 
Test scores and the results obtained in a,given electrical course. 
To be specific, suppose that the bivariate distribution of these 
two variables appears to be normal in form and it is estimated 

^ For another illustration of the use of monovariates in forecasting, see 
Robert J. Myers, ‘‘Comparison of Demographic Rates Assumed by the 
National Resources Committee with Actual Experience,” Journal of the 
American Statistical Association, Vol. 38 (1943), pp. 201—209; also, for an 
example of such forecasting for the purpose of control of quality of manu¬ 
factured product, see William B. Rice, “Quality Control Applies to Busi¬ 
ness Administration,” Journal of the American Statistical Association, 
Vol. 38 (1943), pp. 228-232; W. A. Shewhart, Economic Control of 
Quality of Manufactured Product (1931), 
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that the line of regression of electrical grades on Army General 
Classification Test scores is 

= 5 + o.sr 

with a first-order standard deviation of seven points. From 
this it is possible to make certain predictions regarding the future 
results of men taking the electrical course. Thus, it might be 
predicted that men who got 90 on the Army General Classi¬ 
fication Test will get an average grade of 77 on the electrical 
course, half of them getting less than this and half getting more. 
If 70 is taken as the passing grade, it may be predicted that 
something like 84 per cent of the men whose score is 90 will 
pass the course, 70 being one standard deviation less than the 
average and the distribution being normal.^ 

Forecasts from Multivariate Distributions. Study of multi¬ 
variate distributions may permit the same sort of forecasting as 
the study of bivariates. Suppose that a study of army salvage 
data shows that the length of service of wool socks is closely 
related to the amount of marching required of the troops and the 
average temperature of the area. Suppose, for example, that 
the plane of regression derived from the data is as follows:^ 

Length of life = 40 days — 3 (average miles marched per day) 

+ 2 (average temperature) 

Then, if the average miles marched is increased by 2 miles per 
day, the Army may predict that the average length of life of 
wool socks will probably decrease by 6 days. Or again, if the 
troops are shipped to an area in which the average temperature 
is 10 degrees warmer, the average length of life of the wool socks 
will be increased by 30 days. Or, in a third instance, if it is 
planned to send a force to a given area for which the average 
temperature is approximately 60 degrees and it is expected that 
the troops will march about 20 mileif per day on the average, 

^ For illustrations of the use of bivariates' for prediction in business, see 
Patricia Daly and Paul H. Douglas, “The Production Function for Cana¬ 
dian Manufactures,'^ Journal of the American Statistical Association, Vol. 
38 (1943), pp. 178-186; also see pp. 674-675. 

* The correlation between length of life and temperature is assumed to 
be positive on the ground that, the warmer the climate, the less the use 
that would be made of wool and the more the use that would be made of 
c^ton socks. See the discussion on planes of regression, pp. 448-455. 
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then it should have sufficient supplies of wool socks on hand to 
provide for new issues every 100 days on the average.^ If the 
study indicated that the first-order standard deviation about the 
plane of regression was about 5 days, the Army might keep on 
hand a large enough supply to replace socks every 85 days (that 
is to say, 100 minus three times the standard deviation, or 
100 - 15 = 85). 

Errors of Forecasts. In concluding this section on^the use 
of monovariates, bivariates, and multivariates as forecasters, it 
must be noted that forecasts of the kind indicated are necessarily 
inexact. They are based on the assumption that the population 
is exactly known. When the population characteristics them¬ 
selves have to be estimated, as they usually do, then the fore¬ 
casts based upon these estimates will suffer from all the errors 
involved in the latter. The more refined analysis that is required 
to take care of these errors of estimate of the population is 
beyond the scope of this book. It is sufficient here to point out 
that the error of forecast based upon estimated population char¬ 
acteristics is greater than that based upon a known population. 
For example, if a plane of regression based upon sample estimates 
has a related standard deviation of five points, the probability of 
a forecast based on the plane being off by as much as two times 
five points in either direction (therefore, 2a) will be, not the 
normal 5 per cent, but perhaps 10 per cent or more. Every¬ 
thing depends on the size of the sample from which the original 
estimates of the population characteristics were made. 

FORECASTING TRENDS WITH TIME SERIES 

More Exact Forecasting. If much is known about a par¬ 
ticular time series, so that the nature of its growth and cyclical 
movements can be fairly well determined from rational con¬ 
siderations and if the remaining fluctuations are apparently 
random, forecasting from such a series can be put on much the 
same level as forecasting from distributions of the monovariate, 
bivariate, and multivariate type discussed above. Careful 
estimates may be made of the growth, and these may be extra¬ 
polated for a short period of time into the future. Distribution 
analysis of the random fluctuations will determine the range of 

^ For the summer season this would be less than for the winter season, 
but a stock level based on a 100-day t\imover might be taken as normal. 
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fluctuations around the growth curve and will afford an estimate 
for the error of a forecast based solely on the growth element in 
the series. 

Suppose, for example, that a logistic form of growth appears to 
be very logical for a certain type of data. If the data have 
reached a certain stage of development, the values of the next 
few periods may be forecast from an extrapolation of the logistic 
curve fitjbed to the past data. The amount of error in the fore¬ 
cast resulting from the random fluctuations around the normal 
growth may be estimated from the standard deviation of the 
residual fluctuations of the data from the fitted logistic curve. 

Illustrations of the type of time series that would permit 
fairly exact forecasting are afforded by Fig. 54 in Chap. V and 
Figs. 144 and 145 in Chap. XXI. 

Less Exact Forecasting. The real difficulty in most time- 
series analyses is to determine what is random and what is not 
random. Furthermore, it is often hard to work out any rational 
basis for specific forms of the trends and cycles.^ In cases 
where there is no particular trend indicated by the rationale of 
the situation, forecasts must be of a rough-and-ready sort, and 
little can be done to determine the error of forecast. 

Economic time series are generally of the sort that do not 
permit more exact statistical forecasts. ^ For this reason statis¬ 
tical analysis is usually only one of the elements entering into 
the making of economic forecasts. In some cases it plays a more 
important role than others, but nearly always the forecaster 
incorporates his statistical findings into a general appraisal of 
the situation. As indicated above, statistical analysis in these 
cases is itself largely intelligent guessing. The statistical part 
of an economic forecast is consequently merely the quantitative 
ingredient of the final forecast. 

Public utilities, especially the telephone companies, are 
keenly interested in the subject of forecasting growth or trend 
elements in time series. In the telephone business the laying 

1 See discussion on rational vs. empirical trends, pp. 550-566. 

* Tintner, Gerhard, '^The Analysis of Economic Time Series,** Journal 
of the American Statistical Associaiionf Vol. 36 (1940), pp. 93-101. Wallis, 
W. Allen, and Geoffrey H. Moore, A Significance Test for Time Series 
Analysis,” Journal of the American Stalistical Associaiian, Vol. 36 (1941), 
pp. 401-409; Vol. 38 (1943), pp. 153-164. 
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out of plans and the construction of new exchanges necessitate 
some sort of forecast as to the future growth of the community. 
For years these companies have maintained elaborate and 
efficient research organizations whose business it is to forecast 
trends in growth of population as well as the geographical 
distribution of various types of business and residential areas 
in the communities served. 

Most business enterprises, however, are more concerned 
about cyclical fluctuations than about trend or growth in time 
series. For this reason the greatest number of published fore¬ 
casts have to do with the prediction of cyclical movements in 
business conditions. 

FORECASTING CYCLES WITH TIME SERIES 

All that has been said about the inexactness of forecasting 
trends by the use of time series applies equally to the forecasting 
of cycles with time series. Nevertheless, the practice of relying 
upon statistics as an aid to business is now so prevalent that 
statistics, along with accounting, has become one of the standard 
tools and one of the essential means of internal control of nearly 
all economic enterprises, as well as a guide to public policies of 
governmental agencies. Among its many commercial uses, 
business forecasting is one of the most important, and it is along 
this line that statistical analysis has been intensively developed 
in recent years. Today there are several important agencies 
that supply forecasting services. Among these are Standard & 
Poores Corporation, Brookmire Economic Service, Moody's 
Investor's Service, Babson, and the Harvard Economic Society. 
In addition, many commercial banks such as National City, 
Cleveland Trust, and Chase National include forecasts of prob¬ 
able future business trends in their monthly letters. Supple¬ 
menting these professional services are the statisticians and 
statistical departments of many large corporations, such as the 
American Telephone and Telegraph Company, which make fore¬ 
casts for their own use. 

American activity in this field has been internationally con¬ 
tagious. As early as 1921 the publication of the Economic 
Bulletin of the Conjuncture Institute was begun in Moscow; this 
publication was devoted’ to the study of business cycles and 
to the analysis of Russian statistics. Subsequently in nearly 
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every important European country during the 1920's and 
1930^8 forecasting services were organized, sometimes by the 
large universities. The League of Nations showed its intention 
of inaugurating forecasting on a world scale by appointing a 
Committee of Experts on Economic Barometers.^ 

The many possible occasions when forecasting is required in 
modern business can be shown by a few examples. A com¬ 
mercial banker granting a loan must forecast the probability 
of its being repaid; his judgment in this respect will depend on 
his forecast of the borrower’s future earning power; this, in turn, 
depends on his estimation of probable future price stability 
in the borrower’s business. Similarly, a collateral loan will 
involve a prediction, more or less precise, of the future value 
of the security offered as collateral. A manufacturer needs to 
forecast probable sales and probable prices of his own goods 
and of materials he has to purchase, so that he can profitably 
plan production and plant expansion. A public-utility operator 
needs to forecast population and industrial trends, construction 
and operating costs, and probable prices for the service, in 
order to determine when and where to build a railroad line, a gas 
main, a power plant, or a telephone exchange. 

All these things are commonplace in economic life, but the 
growing complexity and interdependence of economic society 
have made it increasingly difficult for the average businessman 
to comprehend an existing situation in trjdng to formulate his 
programs for the future. He is not a statistical expert. His 
knowledge of methods of summarization and comparison goes 
usually little beyond a vague comprehension of averages. To 
aid him, it is the purpose of the various business forecasting 
agencies 'Ho provide the basis for business, financial, and security 
market policy. Regardless of the inevitable margin of error 
in every forecast, business, financial, or security market policy 
which is geared to only a fairly intelligent estimate of future 
probabilities is more likely to be sound than is policy geared 
only to guess, or to no forecast whatsoever.’’^ 

1 Cox, G. V., An Appraisal of American Business Forecasts^ pp. 1-2. 

* ‘‘A Forecaster’s View of Forecasting,” Standard Statistics^ (June 15, 
1931), p. 14. Also see Bratt, Elmer C., Biisiness Cycles and Forecasting 
(1941), pp. 736-800; Hardy, Charles O., and Garfield V. Cox, Fore¬ 
casting Business Conditions (1927). 
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Forecasting General Business Conditions. One of the most 
important objects of economic forecasting is to predict general 
business conditions, that is to say, the cyclical position of general 
business. Good times and bad times are such important 
elements in determining the prosperity of individual lines of 
activity and of individual firms that the prospect for general 
business is probably the first thing any corporation executive 
wishes to know. Statistically, general business is properly 
measured by some index of all business activity. It is the sum¬ 
mation of the whole and not merely one of the parts, although 
an index of a part, say an index of industrial production, may be 
taken as a barometer of the upswings and downswings of the 
whole. Such series are commonly called ^^business barometers.”^ 

Forecasts of general business conditions are based upon one 
of two forecasting methods or a combination of the two. The 
first method is known as “historical analogy,” the second as 
“crosscut analysis. 

The method of historical analogy is based on the assumption 
that in cyclical fluctuations history tends to repeat itself. In 
its cruder forms, this consists merely in forecasting the course of 
general business, subsequent to some disturbance, from the 
course of general business that followed a similar disturbance 
in the past. For example, the forecaster might undertake to 
predict the course of general business following the crisis of 1929 
from the course of business followng the crisis of 1873. In 
more carefully thought-out form, historical analogy becomes a 
business-cycle theory that attempts to explain how the interplay 
of economic forces causes general business now to rise and now 
to fall. 

Crosscut analysis proceeds on the basis that the business 
situation is never the same and that each new upsmng or down¬ 
swing is,the product of a set of factors different from those 
previously operative. To understand the given situation all 
the factors must be weighed as to their importance and a net 
appraisal of the situation derived. 

1 See pp. 530-535. 

2 For more elaborate classifications see Bratt, op. cit^ pp. 736-760; 
Haney, L. H., Business Forecasting (1931), p. 195; Day, E. E., ^‘The R61e of 
Statistics in Business Forecasting,” Journal of the American Statistical 
Association^ Vol. 33 (1938), p. 2. 
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In good forecasting, both methods are employed. If a certain 
cyclical theory appears to constitute a good explanation of past 
events, it is good forecasting practice to consider it in predicting 
future cycle changes. Nevertheless, careful study must be 
made to see whether the role played in the past by a particular 
industry or economic development is subsequently being played 
by some other industry or development. The user of historical 
analogy must always, therefore, be on guard for changes required 
in the statistical embodiment of the cyclical theory on which his 
analysis is based in order to keep it up to date in its assumptions. 
During the railroad era, statistics about railroads dominated 
the scene as good indicators of general business conditions; 
later, it was statistics about automobile production; perhaps 
the time will come when it will be aircraft production. Again, 
the present era is often regarded as the ‘^iron age,” Statistics 
of iron and steel production are often used as barometers of 
general business conditions because so many of the products 
of the modern age depend upon iron. Perhaps the time will 
come when the emphasis will shift, from the point of view of 
statistics, away from iron and steel production to the production 
of the lighter metals such as aluminum. Who can say when the 
world of business is changing from the one to the other? 

Reflection along the lines indicated in the preceding paragraph 
leads to the conclusion that continuous crosscut analysis is 
needed as a means of verifying and justifying the use of the 
historical-analogy method. 

Forecasting by Historical Analogy, One type of forecasting by 
historical analogy makes extensive use of the fluctuations in 
particular time series that appear to lead general business con¬ 
ditions. Examples of series that have been used as business 
barometers are indexes of stock-market prices, changes in unfilled 
orders of the United States Steel Corporation, machine-tool 
orders, and the loan-deposit ratio. These series, it is argued, will 
tend to lead changes in general business conditions, and important 
changes in general business conditions will first be made apparent 
by them. For example, a clear and consistent downswing in 
unfilled steel orders is presumed to prestige a similar downswing 
in general business. Hence the latter is presumably forecasts 
from the former. In the case of the loan-deposit ratio, it is the 
attainment of certain critical levels that is significant; when high 
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Remew of Economic Statistics^ VoL 23 (1941), p. 94.] 
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levels are reached, for example, (i.e., when loans are high relative 
to deposits) strained credit conditions are in evidence and a 
crisis will be forecasted. 

More elaborate analyses making use of the historical analogy 
for forecasting combine several economic series. A well-known 
example of such a combination is that prepared by the Harvard 
Economic Society and published in the Review of Economic 
Statistics. While the society itself makes no forecasts from its 
statistical series, they have been found useful for such purposes 
and it is generally understood that that is what they are published 
for. These are shown in Fig. 152. 

The Harvard series consist of three curves, known as the 
A, By and C curves. The A curve represents speculation, the 
B curve business, and the C curve money. The actual data 
upon which these curves have been based vary from time to 
time. In those shown in Fig. 152, the curves are constituted as 
follows:^ The A curve, speculation, is based on the price of all 
securities listed on the New York Stock Exchange. The B 
curve, business, is based on bank debits in 241 cities outside 
of New York City, The C curve, money, is based on short-term 
money rates. In each of the constituent series the trend and 
seasonal variation were removed (when it was deemed appropri¬ 
ate) before the final indexes were computed. 

The theory that underlies the use of the Harvard curves 
for forecasting is that changes in speculation will generally 
precede changes in general business and that the significance 
of these changes Avill be more clearly understood when the 
course of the money curve is noted. A sharp rise in speculation 
at a time when money rates are low and still falling would appear 
to forecast better business conditions. On the other hand, a 
fall in speculation at a time when money rates are rising would 
appear to forecast a decline in general business. If coupled with 
a detailed crosscut analysis of the current business situation, 
these curves are found very helpful in predicting general business 
conditions. 

The Harvard curves are but one set of curves that have been 
employed in this attempt to forecast general business conditions. 
Various combinations of curves have been used. A number 

iFkickey, Edwin, ** Revision of the. Index of General Business Condi¬ 
tions,” Review of Economic Stoiti^ics, Vol. 14 (1932), pp. 80-87. 



THE ART OF FORECASTING WITH STATISTICS 667 

make use of capital issue by private corporations and capital 
expenditures of the various government bodies. The idea 
behind the use of investment curves is that the volume of income, 
and hence business, is largely determined by the volume of 
investment. 

As the result of a great amount of research work during the 
past twenty or twenty-five years, mostly undef the auspices 
of the National Bureau of Economic Research or the National 
Industrial Conference Board, but also by scholars in the United 
States Department of Commerce, increasing attention has been 
given to methods of measuring business conditions based upon 
quantity and distribution of national income. Instead of indexes 
of production, employment, volume of trade, and the like, these 
new indexes attempt to measure national income and its distribu¬ 
tion, consumer expenditures and producer expenditures, savings, 
capital formation, and the like. Figure 153 gives a picture of 
annual consumer spending, 1919-1942, showdng indexes con¬ 
structed by Kuznets (National Bureau of Economic Research) 
and by the United States Department of Commerce ^ Figure 
154 is another illustration of the use of national-income statistics 
and their derivatives to show the cycle in general business 
conditions. This figure reproduces an index of that part of the 
national income devoted to expenditures for new durable goods 
and indexes of gross capital formation, net capital formation, and 
offsets to savings. The United States Department of Commerce 
index of private gross capital expenditures is presumably equiv¬ 
alent to Kuznet's gross capital formation; to these are added 
indexes by Laughlin Currie reputed to measure income-producing 
Federal expenditures that offset savings and net government 
contribution to savings. The index of expenditures for new 
durable goods is constructed by the Board of Governors of the 
Federal Reserve System. 

Time and experience will reveal whether or not the national- 
income type of indexes proves to be better than the barometer 
or over-all measure of business activity types. The national- 
income type has been made possible by the increasing amount of 

iRoffenberg, Marvin, and Mabel S. Lewis, Estimates of National 
Output, Distributed Income, Consumer Spending, Saving, and Capital 
Formation ,Review of Economic Statisttcsy Vol. 25 (May, 1943), pp. 107— 
174, 124. 
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Fig. 153. —Consumer Spending, 191^1942. [From The Review of Economic Statistics, Vol. 25 {May, 1943), p. 124.] 





Fig. 164.—Capital Formation, expenditure for new durable goods, and offsets to savings, 1919-1942, [From The 
Rooiew of Economic Statistical Vol, 25 {May, 1943), p. 142.] ' 
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statistical data on income resulting from the administration of 
the Federal personal and corporate income taxes. 

Perhaps the greatest difficulty with all forecasting series is 
that the amount of the lead or lag is likely to vary considerably 
from time to time, so that the timing of the forecasted change 
becomes difficult.^ Another difficulty is to judge how great a 
change in the forecasting series must be before it is considered 
significant. The curve is almost bound to show minor ups and 
downs that are little related to general business. Presumably a 
movement either up or down must be great and persistent before 
any significant change is forecasted, but how great and how 
persistent is the question. The answer to this question is always 
easy to read ex post facto, but in following the forecaster from 
month to month this is more difficult. If a lead is short and 
data are not reported quickly, a given forecasting series, con¬ 
sistent and reliable as it may be, is unlikely to have much fore¬ 
casting value since the change would be under way before it 
was manifested by the forecasting data. 

These difficulties apply in differing degrees to the various 
kinds of forecasters. In the case of the barometer type, which 
is ordinarily dependent upon one presumably indicating series 
such as unfilled orders of the United States Steel Corporation, 
the data are usually promptly available; but the minor ups and 
downs and the varying degree of lead and lag in the barometer as 
compared with general business constitute ever-present diffi¬ 
culties in their use. The indexes of general business activity 
based upon combinations of several series are less affected by 
difficulties with respect to lead and lag and minor fluctuations, 
but it is often difficult to find a combination of series that are 
promptly reported. The national-income type of indexes suffer 
particularly from the fact that the data are not available cur¬ 
rently, except for estimates that are being attempted, and these 
are dependent upon other types of data. 

A unique type of forecasting by historical analogy is employed 
by Roger Babson. The forecasting instrument is the Babson 
index of the physical volume of production. This covers manu¬ 
facturing production, mineral production, agricultural market- 

.. ^ But see p. 674 for application of leads and lags to the forecasting of 
specific lines of business activity, in which it can be more successfully 
applied. 
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ings, building and construction, railway freight, electric power, 
and foreign trade. The long-run trend of this curve is taken 
as normal, and the cyclical fluctuations are forecasted'on the 
mechanical principle that a given action has an equal and 
opposite reaction. ^ Thus the area of a given period of prosperity 
will indicate the area, but not the shape, of the coming depression. 
The slope of the depression area is forecasted to some extent 
Avith the help of other series and contemporary crosscut analysis 
of individual lines of activity. 

Forecasting by Crosscut Analysis, Even if considerable 
reliance is placed upon certain forecasting series based upon the 
historical-analogy principle, it would seem desirable to supple¬ 
ment the analysis by a more detailed study of the current 
situation. This will help to time the forecast better. It will 
also assure the forecaster that the forecasting series continue to 
hold their theoretical significance in the ebb and flow of business. 
The great danger is that the business-cycle theory on which the 
forecasting series are based may become outmoded or may be too 
simple to be fully satisfactory as new conditions unfold. Cross¬ 
cut analysis may possibly reveal these defects and help to 
remedy them. 

Some believe that business cycles are unique and that the 
roles played by various economic developments shift from cycle 
to cycle. If this were true, crosscut analysis would be the only 
logical method of forecasting. Some general theory would neces¬ 
sarily have to underlie the forecast, even if it were the negative 
theory that all cycles are unique. Nevertheless, it is necessary 
to examine all the important sectors of the economy, to weigh 
their relative importance in the given situation, and to determine 
the net outcome. This requires comprehensive surveys and 
shrewd judgment based on Avide experience. 

Such agencies as the Brookmire Economic Service, Standard & 
Poor’s Corporation, and Moody’s Investor’s Service generally 
follow the crosscut method. The Brookmire Economic Service 
watches carefully selected series, such as building construction, 
motorcar output and registration, exchange rates, and industrial 
employment. The importance attributed to the various series 
differs from time to time. Also, new ones are added and old ones 
discarded as the economic situation changes. In all cases where 

^ Seasonal variations in individual series are eliminated. 
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warranted, the Brookmire Economic Service distinguishes care¬ 
fully between basic trends, seasonal variation, and business 
cycles in its appraisal of the business outlook. The Standard & 
Poor’s Corporation also watches many lines of activity and 
forecasts development in each line. The forecast of general 
business is mainly a summary of these many individual forecasts. 
Moody’s Investor’s Service likewise bases its general forecast 
upon many individual analyses. In making its forecasts, how¬ 
ever, Moody’s appears to be especially influenced by business¬ 
men’s anticipations of profits, a factor that receives much empha¬ 
sis in modern business-cycle theory. 

Forecasting Particular Lines of Activity. The same methods 
are used to forecast particular lines of activity as for general 
business conditions. Crude historical analogy, the use of 
leading series, and crosscut analysis all play their roles. 

Crude Historical Analogy. Figure 155 is an excellent example 
of the use of crude historical analogy in forecasting the course of 
agricultural prices and of wage rates during a long and extensive 
world war. Here the course of agricultural prices and wages in 
the First World War is taken as the pattern for the expected 
course of agricultural prices and wages in the Second World War. 
From the proximity of the two series to each other until the 
beginning of 1943 it would seem that the forecasting power of the 
former series is relatively high. This method is of greater value 
in forecasting particular lines of activity than it is when applied 
to general business conditions, although crosscut analysis might 
modify judgment of this forecast by pointing out that the efforts 
at price control and inflation control in the Second World War 
appear to be more courageous than they were in the First 
World War. 

Lead-Lag Relationships. Figure 156 illustrates the lead-lag 
relationship in forecasting hog production. In raising hogs the 
principal cost is the corn on which the hogs are fed. Further¬ 
more^ the ratio between the amount of corn fed to a hog and his 
weight is fairly constant. Hence, the profitability of hog 
raising is essentially indicated by the so-called ‘com-hog dif¬ 
ferential,” which is the difference between the price of 100 pounds 
of hogs and the cost of enough com to raise 100 pounds of hogs. 
As this differential increases, hog production^ becomes more 
|ii^fitable; as it.decreases, hog production becomes less profitable. 
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The increase or decrease in profitability affects hog production 
with several months lag. Hence, changes in the corn-hog dif¬ 
ferential can be used to forecast changes in hog production, as 
shown in the figure. 



OH DATA mtt HtOSHAL HtSMTW BAHK OH HBW TOM, HOVDLY BAHHINOD Or 
tMOI» OH-HABU UBOa AMD BOiLDiHO TBABBS, AND BAlABtMB OF TSAOHBBS AHD OIBBIOAL BMF1AYBM9 
AAMmtMD BOB BBABOHAi, TABtATJOH 


Fia. 156.—Prices received by farmers and composite wage rates. Index num¬ 
bers. United States, 1914-1920, and 1939-1943. [From The AgricuUural 
Situation, {May, 1943), p.S, published by the Bureau of Agricultural Economics, 
UnUed States Department of Agriculture,] 


The Cycle Hypothesis. The lag 6f hog production behind the 
com-hog differential not only permits forecasting of the former 
but also tends to cause periodic upswings and downswing in the 
two series. The reaso^ for this is as follows; Suppose that the 
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demand for pork increases and, owing to the inability to increase 
rapidly the production of hogs, the corn-hog differential rises. 
This makes hogs more profitable to produce, and their number is 
gradually increased. The lag in response, however, may cause 
the differential to go higher than it would otherwise, and this in 
turn might stimulate a greater increase in production than is 
required to meet the new demand that caused the original rise 
in the ratio. When this enlarged production comes on the 
market, prices fall and the corn-hog differential drops. Owing to 



♦ AVERAGE PRICES OF PACKER AND SHIPPER PURCHASES AND NO S YELLOW CORN 
AIg.MONTH MOVING AVERAGE OF FEDERALLY INSPECTED HOC SLAUGHTER 


Fig. 156.—Hog-corn price ratio and hog marketings, 1901-1942. {From Bureau 
of Agricultural Economics^ United States Department of Agriculture.) 


the greatly increased supply, prices go lower than their natural 
level and hog production becomes less profitable for a ^yhile. 
The change in profitability causes hog production to drop off, 
and ultimately prices tend to rise again, completing the cycle. 

This existence of a periodic movement in the corn-hog dif¬ 
ferential and in hog production permits forecasting for some 
distance into the future. If a great war does not interrupt the 
normal course of economic forces, the high and low periods in 
the com-hog differential can be predicted with a fair degree of 
accuracy. Wise hog farmers gain considerably from this long- 
range forecasting. Similar periodic movements tend to appear 
in other linels of agriculture in which production lags behind 
price stimuU. ^ For example, the cattle ,cycle rubs about fifteen 
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years, according to studies made by the United States Depart¬ 
ment of Agriculture. 

Crosscut Analysis, The application of crosscut analysis to 
particular lines of activity is based in many instances on the 
analysis of supply and demand conditions. In agriculture, the 
carry-over and current crop prospects are important factors 
on the supply side. The economic condition of industries or of 
sections of the population using the given product, the prices of 
competing products, and the output of competing areas are 
important factors on the demand side. If the product has 
Avidespread uses, possibly prediction of changes in consumer 
incomes or in general industrial activity might be the best way of 
forecasting the future demand for it. 

In manufacturing, principal attention is likely to be devoted 
to demand. When the demand is industrial, the forecasting 
takes primarily the form of predicting conditions in those lines 
of activity immediately supplied by the given line of manufac¬ 
turing. Thus steel production might be forecasted from railroad 
construction and maintenance, automobile production, road 
construction, and building activity. When the product is one 
sold to the consuming public and not to other industries, the 
analysis of demand becomes largely a study of the flow of income 
to consuming areas. This will be dependent on the prosperity 
of important industries in these areas and on the net flow of 
incomes from outside sources. The prices of competing products 
will also be an important demand factor. 

A statistical technique using multiple and partial correlation, 
mathematical and graphic methods, has been developed for 
making crosscut analyses such as those suggested in the two pre¬ 
ceding paragraphs. This technique is widely used; in the case 
of many products the multiple- and partial-correlation technique 
makes it possible to derive demand curves that will forecast with 
considerable accuracy the amount of change in sales that would 
accompany a given contemplated change in price. ^ 

FORECASTS WITH SEASONAL VARIATION 

Forecasting with seasonal variation is probably the oldest of 
all types of modern forecasting and is so general as to be common- 

1 C/. Schultz, Henry, The Theory and Measurement of Demand (1938). 
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place. It is applied to particular lines of activity more speci¬ 
fically than to general business conditions. 

Historical Analogy. Use of historical analogy for forecasting 
with seasonal variation is simpler and more dependable than the 
use of historical analogy for cyclical or trend forecasts. The 
conditions underlying persistent seasonal variations are more 
readily analyzed than are the rational explanations of cycles and 


onra RATES PER iOOO RNNURL BASIS {l943 9n promioMt) 



Fio. 157.—Mortality from all causes. Metropolitan Life Insurance Company 
industrial department, weekly premium-pajdng business. {From the Statistical 
BvUetin, Vol. 24 {July, 1943), p. 12, published by the Metropolitan Life Insurance 
Company,] 

trends. Moreover, the forecasting is for a shorter period into 
the future and can therefore depend upon conditions remaining 
approximately unchanged pending the outcome of the forecasted 
events. Statistical techniques have been developed for measur¬ 
ing the dependability of a given seasonal variation.^ 

Figure 157 illustrates the extrapolation of seasonal variation, 
which is the use of historical analogy for making a forecast with 
seasonal variation. From the figure it can be forecast, by 


^Seepp. 631-636. 







THE ART OF FORECASTING WITH STATISTICS 677 

assuming a continued agreement between 1942 and coming 1943 
seasonal movement in mortality from all causes, that the Sep¬ 
tember death rate per 1,000, annual basis, will be about 7.25, 
the October and November rate about 8.25, and the December 
rate abdut 8.50. 

Figure 158 is an application of the use of forecasting seasonal 
variation by historical analogy to the field of agricultural eco¬ 
nomics. Extrapolation of the 1943 curve predicts that income 
from farm marketings in the South Central region of the United 
States will fluctuate around 200 million dollars monthly until 
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Fig. 168.—Cash income from farm marketings 1942-1943 compared with 
1937-1941 average. [From The Agricultural Situation, Vol. 27 {June, 1943), p. 8, 
published by the Bureau of Agricultural Economics, United States Department of 
Agriculture. 



July or August; thereafter, monthly cash income from farm 
marketings in that region will rise sharply to a peak in October 
of perhaps 500 million dollars or higher, since the 1943 level 
appears to be a higher average than that of 1942. This figure 
shows the annual average seasonal movement, 1937—1941, which 
^ves a somewhat more dependable seasonal indicator than a 
single year’s figures. 

Combining Seasoned with Cyclical Forecasting. Whenever it 
is desired to make forecasts on the basis of a period shorter than 
a yfear, it is necessary to apply a seasonal forecast along with 
cyclical forecasting. In the case of conventional forecasting 
by the use of business-cycle studies and the resulting barome^ra, 
general business indexes, and crosscut analysis, discussed in a 
preceding section of this chapter, short-period forecasts based 
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upon known seasonal variations are used as well as the cyclical 
forecasts. 

Many illustrations could be found of the application of this 
combination of seasonal with cyclical forecasting. Figure 159 
is an illustration in the field of agricultural economics.* Based 
upon statistical forecasting of the cycles in production of live¬ 
stock, similar to the cycle analysis of hog production already 
outlined, the levels of livestock marketings for 1943 and 1944 



Fig. 159, —Transportation loads for livestock, estimated on basis of indicated 
marketings and shipments from public markets, United States, January, 1941- 
March, 1944. [From The Agricultural SUualion^ Vol. 27 {February, 1943), p. 8, 
published by the Bureau of Agricultural Economics, United States Department 
of Agriculture.] 

are forecast. The annual amount is then distributed throughout 
the months of the year according to the predetermined index of 
seasonal variation. The figure presents the resulting forecast 
of monthly transportation loads for livestock, estimated from 
indicated marketings and shipments from public markets in the 
United States. On the same figure are shown the actual amounts 
monthly for the years 1941 and 1942, for purposes of comparison. 
Figures similar to this one for various lines of industrial and 
manufacturing activity appear frequently in such publications 
as the Survey of Current Business and in the publications of the 
various forecasting agencies. 
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THE QUALITY OF FORECASTING 

The success of forecasting is hard to judge. First it'is to be 
noted that if an agency declines to make forecasts in difficult 
situations and makes rather limited forecasts in general, it is 
likely to have less failures than one that boldly undertakes to 
forecast on all occasions and in considerable detail. The success 
of a forecasting agency should be judged according to what it 
attempts to do. 

The success of forecasting should also be judged in the light 
of what might be accomplished by mere random guessing. In 
other words, an agency should be right at least 50 per cent of the 
time, or it is worse than useless. Judged on these bases, the 
various economic forecasting agencies have been fairly successful 
in forecasting general business conditions. Although not 
registering anything near a perfect score, they have at least 
been better than chance. 

One of the chief problems of economic forecasting lies in 
the effect of the forecast itseK. The effect of the forecast may 
conceivably be such, on the one hand, that the forecast actually 
causes the forecasted event to occur, or, on the other hand, that 
the forecast actually prevents the forecasted event from occur¬ 
ring. Whether or not such untow^ard results are produced 
depends largely on how widely the forecast circulates. On the 
one hand, suppose a forecasting agency predicts a general infla¬ 
tion of prices and enough people become convinced that the 
forecast is a true one; in such a case, the forecast may not only be¬ 
come true but be itself the cause of the thing that is forecasted. 
On the other hand, a subscriber to a forecast expects to profit from 
its use, in that his plans will anticipate probable future conditions 
of which a competitor is supposedly not so well informed. The 
fewer who have this information, the more hkely it is that they 
mil profit and that the forecast will be a true one. But the 
wider the acceptance of the forecast, the less chance the indi¬ 
vidual subscriber has to gain and the less likely is it that the 
forecast will prove to be true. Suppose, for example, that a 
forecasting agency advises its clients in a given productive 
activity that the price of its product is going to rise as a result of 
some foreseen increase in demand; if too many of the producers 
obtain the forecaster's service and follow its advice, overproduc- 
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tion will result and the price will decline rather than rise. This is 
an illustration of how a forecast might defeat itself. 

In the final analysis, it may be said that the greatest value of 
modem forecasting work lies in the large amount of statistical 
economic analysis that it promotes. Research into the business 
cycle and continued improvements in the statistical approach to 
social and economic problems cannot fail to reveal closer and 
closer approximations to the truth and thereby improve general 
knowledge about economic and social conditions. 



APPENDIX 

Table I.—Four-place Common Logarithms op Numbers^ 




Tenths of the 
Tabular 
Difference 
1 2 3 4 5 


0170 0212 0263 0294 0334 0374 0414 

0669 0607 0645 0682 0719 0756 0792 

0934 0969 1004 1038 1072 1106 1139 

1271 1303 1336 1367 1399 1430 1461 

1584 1614 1644 1673 1703 1732 1761 

1875 1903 1931 1969 1987 2014 2041 

2148 2176 2201 2227 2253 2279 2304 

2405 2430 2455 2480 2504 2529 2553 

2648 2672 2696 2718 2742 2766 2788 

2878 2900 2923 2946 2967 2989 3010 

3096 3118 3139 3160 3181 3201 3222 2 4 6 8 11 

3304 3324 3346 3365 3385 3404 3424 2 4 6 8 10 

3602 3522 3641 3660 3579 3598 3617 2 4 6 8 10 

3692 3711 3729 3747 3766 3784 3802 2 4 5 7 9 

3874 3892 3909 3927 3945 3962 3979 2 4 5 7 9 

4048 4066 4082 4099 4116 4133 4150 2 3 5 7 9 

4216 4232 4249 4265 4281 4298 4314 2 3 5 7 8 

4378 4393 4409 4425 4440 4456 4472 2 3 5 6 8 

4533 4548 4564 4579 4594 4609 4624 2 3 5 6 8 

4683 4698 4713 4728 4742 4767 4771 1 3 4 6 7 

4829 4843 4857 4871 4886 4900 4914 1 3 4 6 7 

4969 4983 4997 5011 5024 5038 5051 1 3 4 6 7 

5106 5119 5132 5145 5169 5172 5185 1 3 4 5 7 

5237 5250 5263 5276 5289 5302 5315 1 3 4 5 6 

5366 5378 5391 5403 5416 5428 5441 1 3 4 5 6 

5490 6502 5514 5627 5539 5551 5563 1 2 4 5 6 

5611 5623 5635 5647 5658 5670 5682 1 2 4 5 6 

5729 6740 5752 67C3 6776 6786 6798 1 2 3 5 6 

6843 5855 6866 6877 5888 5899 5911 1 2 3 5 6 

5955 5966 5977 5988 5999 6010 6021 1 2 3 4 6 

6064 6075 6085 6096 6107 6117 6128 1 2 3 4 5 

6170 6180 6191 6201 6212 6222 6232 1 2 3 4 5 

6274 6284 6294 6304 6314 6326 6336 1 2 3 4 6 

6376 6386 6395 6405 6415 6425 6435 1 2 3 4 5 

6474 6484 6493 6503 6513 6522 6532 1 2 3 4 5 

6571 6580 6590 6699 6609 6618 6628 1 2 3 4 5 

6665 6675 6684 6693 6702 6712 6721 1 2 3 4 6 

6768 6767 6776 6785 6794 6803 6812 1 2 3 4 5 

6848 6867 6866 6876 6884 6893 6902 1 2 3 4 4 

6937 6946 5965 6964 6972 6981 6990 1 2 3 4 4 


7024 7033 7042 7060 7069 7067 7076 1 2 3 3 4 

7110 7118 7126 7136 7143 7162 7160 1 2 3 3 4 

7193 7202 7210 72l8 7226 7236 7243 1 2 2 3 4 

7276 7284 7292 7300 7308 7316 7324 1 2 2 3 4 

6.4j 7324j 7332j 7340| 7348| 7356 7364 7372 7380 7388 7396 7404 1 2 2 3 4 

* Taken* with permission, from E. V. Huntington's Four Place Tailee of Logarithme and 
Trigonometric Functions (Harvard Cooperative Spoiety, Inc., 1907). 
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Table I.— Four-place Common Logarithms op Numb'ers.- 
(Continued) 
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Table I.—Fouk-place Common Logarithms of Numbers. 

{Continued) 



0237 
0278 

0310 0314 0318 
0350 0364 0358 
0390 0394 0398 


0035 

0039 

0043 

0077 

0082 

0086 

0120 

0124 

0128 

0162 

0166 

0170 

0204 

0208 

0212 

0246 

0249 

0253 

0286 

0290 

0294 

0326 

0330 

0334 

0366 

0370 

0374 

0406 

0410 

0414 

0445 

0449 

0453 

0484 

0488 

0492 

0523 

0527 

0531 

0561 

0665 

0569 


0842 i 0846 0849 0853 


0824 

! 0828 

0860 

1 0864 

0896 

0899 


0966 

0969 

1000 

1004 

1035 

1038 

1069 

1072 

1103 

1106 

1136 

1139 

1169 

1173 

1202 

1206 

1235 

1239 
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Table I.—^Foub-place Common Logarithms of Numbers.— 

{CorUirmed) 



0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

l.fiO 

0.1761 

1764 

1767 

BH 

1772 

1775 

1778 

1781 

1784 

1787 

1790 

1.61 

1790 

1793 

1796 


1801 

1804 

1807 


1813 

1816 

131 

1.52 

1818 

1821 

1824 


1830 

1833 

1836 

1838 

1^1 

1844 

tvf 

1.53 

1847 

1850 

1853 

1855 

1858 

1861 

1864 

1867 

1870 

1872 


1.54 

1875 

1878 

1881 

1884 

1886 

1889 

1892 

1895 

1898 

1901 

1903 

1.56 

1903 

1906 

1909 

1912 

1915 

1917 

1920 

1923 


1928 

1931 

1.56 

1931 

1934 

1937 

1940 

1942 

1945 

1948 

1951 

HSl 

1956 

1959 

1.57 

1959 

1962 

1965 

1967 

1970 

1973 

1976 

1978 


1984 

1987 

1.58 

1987 

1989 

1992 

1995 

1998 

2000 

2003 

2006 

2009 

2011 

2014 

1.59 

2014 

2017 

2019 

2022 

2025 

2028 

2030 

2033 

2036 

2038 

2041 

1.60 

0.2041 

2044 

2047 

2049 

2052 

2055 

2057 

2060 

2063 

2066 

2068 

1.61 

2068 

2071 

2074 

2076 

2079 

2082 

2084 

2087 

2090 

2092 

2095 

1.62 

2095 

2098 

2101 

2103 

2106 

2109 

2111 

2114 

2117 

2119 

2122 

1.63 

2122 

2125 

2127 

2130 

2133 

2135 

2138 

2140 

2143 

2146 

2148 

1.64 

2148 

2151 

2154 

2156 

2159 

2162 

2164 

2167 

2170 

2172 

2175 

1.65 

2175 

2177 

2180 

2183 

2185 

2188 

2191 

2193 

2196 

2198 

2201 

1.66 

2201 

2204 

2206 

2209 

2212 

2214 

2217 

2219 

2222 

2225 

2227 

1.67 

2227 

2230 

2232 

2235 

2238 

2240 

2243 

2245 

2243 

2251 

2253 

1.68 

2253 

2256 

2258 

2261 

2263 

2266 

2269 

2271 

2274 

2276 

2279 

1.69 

2279 

2281 

2284 

2287 

2289 

2292 

2294 

2297 

2299 

2302 

2304 

1.70 

0.2304 

2307 

2310 

2312 

2315 

2317 

2320 

2322 

2325 

2327 

2330 

1.71 

2330 

2333 

2335 

2338 

2340 

2343 

2345 

2348 

2350 

2353 

2355 

1.72 

2355 

2358 

2360 

2363 

2365 

2368 

2370 

2373 

2375 

2378 

2380 

1.73 

2380 

2383 

2385 

2388 

2390 

2393 

2395 

2398 

2400 

2403 

2405 

1.74 

2405 

2408 

2410 

2413 

2415 

2418 

2420 

2423 

2425 

2428 

2430 

1.76 

2430 

2433 

2435 

2438 

2440 

2443 

2445 

2448 

2450 

2453 

2455 

1.76 

2455 

2458 

2460 

2463 

2465 1 

2467 

2470 

2472 

2475 

2477 

2480 

1.77 

2480 

2482 

2485 

2487 

2490 j 

2492 

2494 

2497 

2499 

2502 

2504 

1.78 

2504 

2507 

2509 

2512 

2514 

2516 

2519 

2521 

2524 

2526 

2529 

1.79 

2529 

2531 

2533 

2536 

2538 

2541 

2543 

2545 

2548 

2550 

2553 

1.80 

0.2553 

2555 

2558 

2560 

2562 

2565 

2567 

2570 

2572 

2574 

2577 

1.81 

2577 

2579 

2582 

2584 

2586 

2589 

2591 

2594 

2596 

2598 

2601 

1.82 

2601 

2603 

2605 

2608 

2610 

2613 

2615 

2617 

2620 

2622 

2625 

1.83 

2625 

2627 

2629 

2632 

2634 

2636 

2639 

2641 

2643 

2646 

2648 

1.84 

2648 

1 2651 

2653 

2655 

2658 

2660 

2662 

2665 

2667 

2669 

2672 

1.85 

2672 

2674 

2676 

2679 

2681 

2683 

2686 

2688 

2690 

' 2693 

2695 

1.86 

1 2695 

2697 

1 2700 

2702 

1 2704 

2707 

2709 

2711 

2714 

1 2716 

2718 

1.87 

1 2718 

2721 

2723 

! 2725 

I 2728 

2730 

2732 

2735 

2737 

2739 

2742 

1.88 

2742 

2744 

2746 

2749 

2751 

2753 

2755 

2758 

2760 

2762 

2766 

1.88 

2765 

2767 

2769 

27712i 

2774 

2776 

2778 

>781 

2783 

2785 

2788 

1.90 

0.2788. 

2790 

2792 

2794 

2797 

* 2799 

2801 

2804 

* 2806 

^ 2808 

2810. 

1.91 

2810 

2813 

2815 

2817 

2819 

2822 

2824 

2826 

2828 

2831 

2833 

1.92 

2833 

2835 

2838 

2840 

2842 

2844 

2847 

2849 

2851 

2853 

2856 

1.93 

2856 

2858 

2860 

2862 

2865 

2867 

2869 

2871 

2874 

2876 

2878 . 

1.94 

2878 

2880 

2882 

2885 

2887 

2889 

2891 

2894 

2896 

2898 

1 2900 

1.95 

2900 

2903 

2905 

2907 

2909 

2911 

2914 

2916 

2918 

2920 

2923 

1.96 

2923 

2925 

2927 

2929 

2931 

2934 

2936 

2938 

2940 

2942 

2945 

1.97 

2945 

2947 

2949 

2951 

2953 

2956 

2958 

2960 

2962 

2964 

2967 

1a98 

2967 

2969 

2971 

2973 

2975 

2978 

2980 

2982 

2984 

2986 

2989 


2989 

2991 

2993 

2995 

2997 

2999 

3002 

3004 

3006 

3008 

3010 
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Table II. —^Squares of Numbers^ 


N 

0 

1 

2 

3 

4 

5 

6 

7 

8' 

9 

100 

10000 

10201 

10464 

10609 

10816 


11236 

11449 

11664 

11881 

110 

12100 

12321 

12544 

12769 

12996 

13225 

13456 

13689 

13924 

14161 

120 

14400 

14641 

14884 

15129 

15376 

15625 

15876 

16129 

16384 

16641 

130 

16900 

17161 

17424 

17689 

17956 

18225 

18496 

18769 

19044 

19321 

140 

19600 

19881 

20164 

20449 

20736 

21025 

21316 


21904 

22201 

150 

22500 

22801 

23104 

23409 

23716 

24025 

24336 

24649 

24964 

25281 

160 

25600 

25921 

26244 

26569 

26896 

27225 

27556 

27889 

28224 

28661 

170 

28900 

29241 

29584 

29929 

30276 


30976 

31329 

31684 

32041 

180 

32400 

32761 

33124 

33489 

33856 

34225 

34596 

34969 

35344 

35721 

190 

36100 

36481 

36864 

37249 

37636 

38025 

38416 

38809 

39204 

39601 


40000 

40401 

40804 

41209 

41616 


42436 

42849 

43264 

43681 

210 

44100 

44521 

44944 

45369 

45796 

46225 

46656 

47089 

47524 

47961 

220 

48400 

48841 

49284 

49729 

50176 

50625 


51529 

51984 

52441 

230 

52900 

53361 

53824 

54289 

54756 

55225 

55696 

66169 

56644 

67121 

240 

57600 

58081 

58564 

59049 

59536 


60516 


61504 

62001 

250 

62500 

63001 

63504 

64009 

64516 

65025 

65536 

66049 

66564 

67081 

260 

67600 

68121 

68644 

69169 

69696 

70225 

70756 

71289 

71824 

72361 

270 

72900 

73441 

73984 

74529 

75076 

75625 

76176 

76729 

77284 

77841 

280 

78400 

78961 

79524 

80089 

80656 

81225 

81796 

82369 

82944 

83521 

290 

84100 

84681 

85262 

85849 

86436 

87025 

87616 

88209 

88804 

89401 

300 


90601 

91204 

91809 

92416 

93025 

93636 

94249 

94864 

95481 

310 

96100 

96721 

97344 

97969 

98596 

99225 

99856 

100489 

101124 

101761 

320 


103041 


mmm 


105625 

106276 

106929 

107684 

108241 

330 

108900 

109561 

110224 

110889 

111556 

112225 

112896 

113569 

114244 

114921 

340 

115600 

116281 

116964 

117649 

118336 

119025 

119716 

120409 

121104 

121801 

350 

122500 

123201 


124609 

125316 

126025 

126736 

127449 

128164 

128881 

360 

129600 

130321 

131044 

131769 

132496 

133225 

mmm 

134689 

135424 

136161 

370 

136900 

137641 

138384 

139129 

139876 


141376 

142129 

142884 

143641 

380 

144400 

145161 

145924 

146689 

147456 

148225 

148996 

149769 

150544 

151321 

390 

152100 

152881 

153664 

154449 

155236 

156025 

156816 

157609 

158404 

159201 

400 


160801 


162409 

163216 

164025 

164836 

166649 

166464 

167281 

410 

168100 

168921 

169744 

170569 

171396 

172225 

mvkim 

173889 

174724 

175561 

420 

176400 

177241 

178084 

178929 

179776 

180625 

181476 

182329 

1831841 

184041 

430 

184900 

185761 

186624 

187489 

188356 

189225 

190096 


191844 

192721 

440 

193600 

194481 

195364 

196249 

197136 

198025 

198916 

199809 


201610 

450 


203401 



206116 

207025 

207936 



210681 

460 

211600 

[ 212521 

213444 

214369 

215296 

216225 

217166 

218089 

219024 

219961 

470 

220900 

221841 

222784 

223729 

224676 

225G25 

226576 

227529 

228484 

229441 

480 


231361 

232324 

233289 

234256 

235225 

236196 

237169 

238144 

239121 

490 


241081 




245025 

246016 

im 


249001 

500 


251001 




255025 

266036 


258064 

259081 

510 


261121 

262144 

263169 

264196 

265225 

266256 

267289 

268324 

269361 

520 

BBSs 

271441 

272484 

273529 

274576 

275625 

276676 

277729 

278784 

279841 

530 


281961 



285156 

286225 

287296 

288369 

289444 

290621 

540 

291600 

292681 

[ 

293764 

294849 

295936 

297025 

298116 

299209 

mm 

301401 


1 Sooree; Wauoh, Aibssbt E., Laboratory Manual and Problems for Elements of Stat % st!ical 
Method (McGrftw-HiU Book Company, Ino., 1944). 
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' Table II. —Sqtjabbs of Ntjmbebs.— (Contmued) 


N 

0 

n 


3 

4 

5 

6 

n 


9 

550 

30250C 

30360] 

304704 

305809 

306916 


309136 

310249 

311364 

312481 

560 


31472] 

315844 

316969 


319225 

320356 

321489 

322624 

323761 

570 


326041 

327184 

328329 

329476 


331776 

332929 

334084 

335241 

580 


337661 

338724 

339889 

341056 

342225 

343396 

344569 

345744 

346921 

590 


349281 

350464 

351649 

352836 


355216 

356409 

357604 

358801 

600 


361201 

362404 


364816 

366025 

367236 

368449 

369664 

370881 

610 


373321 

374544 

375769 

376996 

378225 

379456 

380689 

381924 

383161 

620 


385641 

386884 

388129 

389376 

390625 

391876 

393129 

394384 

395641 

630 


398161 

399424 

400689 

401956 

403225 

404496 

405769 

407044 

408321 

640 


410881 

412164 

413449 

414736 

416025 

417316 

418609 

419904 

421201 

650 

422500 

423801 

425104 

426409 

427716 

429025 

430336 

431649 

432964 

434281 

660 

435600 

436921 

438244 

439569 

440896 

442225 

443556 

444889 

446224 

447561 

670 

448900 

450241 

451584 

452929 

454276 

455625 

456976 

458329 

459684 

461041 

680 

462400 

463761 

465124 

466489 

467856 

469225 

470596 

471969 

473344 

474721 

690 

Em 

477481 

478864 

480249 

481636 

483025 

484416 

485809 

487204 

488601 

700 


491401 

492804 

494209 

495616 

497025 

498436 

498849 

501264 

502681 

710 


505521 

506944 


509796 

511225 

512656 

614089 

515524 

516961 

720 


519841 

521284 

522729 

524176 

525625 

627076 

528529 

529984 

531441 

730 


534361 

535824 

537289 

538756 

540225 

541696 

543169 

544644 

546121 

740 


549081 

550564 


553536 

555025 

556516 

558009 

559504 

561001 

760 

iIBi 

664001 

565504 


568516 

670025 

671536 

673049 

574564 

576081 

760 


579121 

580644 

582169 

583696 

685225 

686756 

588289 

589824 

591361 

770 


594441 

595984 

597529 

599076 

600625 

602176 

603729 

605284 

606841 

780 


609961 

611524 


614656 

616225 

617796 

619369 

620944 

622521 

790 


625681 

627264 

628849 

630436 

632025 

633616 

635209 

636804 

638401 

800 


641601 

643204 

644809 

646416 

648025 

649636 

651249 

652864 

654481 

810 


657721 

659344 

660969 

662596 

664225 

665856 

667489 

669124 

670761 

820 

^^B 

674041 

675684 

677329 

678976 

680625 

682276 

683929 

685584 

687241 

830 


690561 

692224 

693889 

695556 

697225 

698896 

mm 

702244 

703921 

840 


707281 

708964 


712336 

714025 

716716 

717409 

719104 

720801 

850 


724201 

725904 


729316 

731026 

CD 

734449 

736164 

737881 

860 


741321 

743044 

744769 

746496 

748226 

749956 

751689 

753424 

755161 

870 


758641 

760384 

762129 

763876 

766625 

767376 

769129 

770884 

772641 

880 


776^61 

777924 

779689 

781456 

783226 

784996 

786769 

788544 

790321 

890 


793881 

795664 

797449 

799236 

801025 

802816 

804609 

806404 

808201 

900 


811801 

813604 


817216 

819025 

820836 

822649 

824464 

826281 

910 


829921 

831744 

833569 

835396 

837225 

839056 

mm 

842724 

844561 

920 


848241 

850084 

851929 

853776 

855625 

857476 

859329 

861184 

863041 

930 


866761 

868624 

870489 

872356 

874226 

876096 

877969 

879844 

881721 

940 


885481 

887364 

889249 

891136 

893025 

894916 

896809 

898704 

900601 

950 


904401 

906304 

908209 

910116 

912026 

913936 

915849 

917764 

919681 

960 


923521 

925444 

927369 

929296 

931225 

933156 

935089 

937024 

938961 

970 


942841 

944784 

946729 

948676 

950625 

952576 

954529 

956484 

958441 

980 


962361 

964324 

966289 

968266 

970225 

972196 

974169 

976144 

978121 

990 

H 

982081 

984064 

986049 

988036 

990025 

992016 

994009 

996004 

998001 
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Table III. —Square Roots op Numbers prom 10 to 100^ 


N 

0.0 

mQiii 


0.3 

D 


0.6 

0.7 

O.^ 

0.9 

10 


3.178 

3.194 

3.209 

3.225 

3.240 

3.256 

3.271 

3.286 

3.302 

11 


3.332 

3.347 

3.362 

3.376 

3.391 

3.406 

3.421 

3.435 

3.450 

12 


3.479 

3.493 

3.507 

3.521 

3.536 

3.550 

3.564 

3.578 

3.592 

13 

3.606 

3.619 

3.633 

3.647 

3.661 

3.674 

3.688 

3.701 

3.715 

3.728 

14 

3.742 

3.755 

3.768 

3.782 

3.795 

3.808 

3.821 

3.834 

3.847 

3.860 

15 

3.873 

3.886 

3.899 

3.912 

3.924 

3.937 

3.950 

3.962 

3.975 

3.987 

16 

4.000 

4.012 

4.025 

4.037 

4.050 

'4.062 

4.074 

4.087 

4.099 

4.111 

17 

4.123 

4.135 

4.147 

4.159 

4.171 

4.183 

4.195 

4.207 

4.219 

4.231 

18 

4.243 

4.254 

4.266 

4.278 

4.290 

4.301 

4.313 

4.324 

4.336 

4.347 

19 

4.359 

4.370 

4.382 

4.393 

4.405 

4.416 

4.427 

4.438 

4.450 

4.461 

20 

4.472 

4.483 

4.494 

4.506 

4.517 

4.528 

4.539 

4.550 

4.561 

4.572 

21 

4.583 

4.593 

4.604 

4.615 

4.626 

4.637 

4.648 

4.658 

4.669 

4.680 

22 

4.690 

4.701 

4.712 

4.722 

4.733 

4.743 

4.754 

4.764 

4.775 

4.785 

23- 

4.796 

4.806 

4.817 

4.827 

4.837 

4.848 

4.858 

4.868 

4.879 

4.889 

24 

4.899 

4.909 

4.919 

4.930 

4.940 

4.950 

4.960 

4.970 

4.980 

4.990 

25 

5.000 

5.010 

5.020 

5.030 

5.040 

5.050 

5.060 

5.070 

5.079 

5.089 

26 

5.099 

5.109 

5.119 

5.128 

5.138 

5.148 

5.158 

5.167 

5.177 

5.187 

27 

5.196 

5.206 

5.215 

5.225 

5.234 I 

5.244 

5.254 

5.263 

5.273 

5.282 

28 

5.292 

5.301 

5.310 

5.320 

5.329 

5.339 

5.348 

5.357 

5.367 

5.376 

29 

5.385 

5.394 

5.404 

5.413 

5.422 

5.431 

5.441 

5.450 

5.459 

5.468 - 

30 

5.477 

5.486 

5.495 

5.505 

5.514 

5.523 

5.532 

5.541 

5.550 

5.559 

31 

5.568 

5.577 

5.586 

5.595 

5.604 

5.612 

5.621 

5.630 

5.639 

5.648 

32 

5.657 

5.666 

5.674 

5.683 

5.692 

6.701 

5.710 

5.718 

5.727 

5.736 

33 

5.745 

5.753 

5.762 

5.771 

5.779 

5.788 

5.797 

5.805 

5.814 

5.822 

34 

5.831 

1 5.840 

5.848 

5.857 

5.865 

5.874 

5.882 

5.891 

5.899 

5.908 

35 

5.916 

5.925 

5.933 

5.941 

5.960 

5.958 

5.967 

5.975 

5.983 

5.992 

36 

6.000 

6.008 

6.017 

6.025 

6.033 

6.042 

6.050 

6.058 

6.066 

6.075 

37 

‘ 6.083 

6.091 

6.099 

6.107 

6.116 

6.124 

6.132 

6.140 

6.148 

6.156 

38 

! 6.164 

6.173 

6.181 

6.189 

6.197 

6.205 

6.213 

6.221 

6.229 1 

6.237 

39 

6.245 

6.253 

6.261 

6.269 

6.277 

6.285 

6.293 

6.301 

6.309 

6.317 

40 

6.325 

6.332 

6.340 

6.348 

6.356 

6.364 

6.372 

6.380 

6.387 

6.395 

41 

6.403 

6.411 

6.419 

6.427 

6.434 

6.442 

6.450 

6.458 

6.465 

6.473 

42 

6.481 

6.488 

6.496 

6.504 

6.512 

! 6.519 

6.527 

6.535 

6.542 

6.550 

43 

6.557 

6.565 

6.573 

6.580 

6.588 

6.595 

6.603 

6.611 

6.618 

6.626 

44 

6.633 

6.641 

6.648 

6.656 

6.663 

6.671 

6.678 

6.686 

6.693 

6.701 

45 

6.708 

6.716 

6.723 

6.731 

1 

6.738 

6.745 

6.753 

6.760 

6.768 

6.775 

46- 

6.782 

6.790 

6.797 

6.804 

6.812 

6.819 

6.826 

6.834 

6.841 

6.848 

47 

6.856 

6.863 

6.870 

6.878 

6.885 

6.892 

6.899 

6.907 

6.914 

6.921 

48 

6.928 

6.935 

6.943 

6.950 

6.957 

6.964 

6.971 

6.979 

6; 986 

6.993 

49 

7.000 

7.007 

7.014 

7.021 

7.029 

7.036 

7.043 

7.050 

7.057 

7.064 

y 

50 

7.071 

7.078 

7.085 

7.092 

7.099 

7.106 

7.113 

7.120 

7.127 

7.134 

51 

7.141 

7.148 

7.155 

7.162 

7.169 

7.176 

7.183 

7.190 

7.197 

7.204 

52 

7.211 

7.218 

7.225 

7.232 

7.239 

7.246 

7.253 

7.259 

7.266 

7.273 

53 

7.280 

7.287 

7.294 

7.301 

7.308 

7.314 

7.321 

7.328 

7.335 

7.342 

54 

7.348 

7.355 

7.362 

7.369 

7.376 

7.382 

7.389 

7.396 

7.403 

7.409 


' Source: Waugh, Albbht E., Laboratory Mantui and Problems for Elements of Statietiedl 
Method (McGraw-Hill Book Company, Inc., 1944). 
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Table III. —Square Roots of Numbers from 10 to 100. — {Continued) 


N 

0.0 



0.3 

0.4 

0.5 

0.6 

0.7 

0.8 

0.9 

55 

7.416 

7.423 

7.430 

7.436 

7.443 

7.450 

7.457 

7.463 

7.470 

7.477 

56 

7.483 

7.490 

7.497 

7.503 

7.510 

7.517 

7.523 

MEm 

7.537 

7.543 

57 

7.550 

7.556 

7.563 

7.570 

7.576 

7.582 

7.589 

7.596 

7.603 

7.609 

58 

7.616 

7.622 

7.629 

7.635 

7.642 

7.649 

7.655 

7.662 

7.668 

7.675 

59 

7.681 

7.688 

7.694 

7.701 

7.707 

7.714 

7.720 

7.727 

7.733 

7.740 

m 

7.746 

7.752 

7.759 

7.765 

7.772 

7.778 

7.785 

7.791 

7.797 

7.804 

61 

7.810 

7.817 

7.823 

7.829 

7.836 

7.842 

7.849 

7.855 

7.861 

7.868 

62 

7.874 

7.880 

7.887 

7.893 

7.899 

7.906 

7.912 

7.918 

7.925 

7.931 

63 

7.937 

7.944 

7.950 

7.956 

7.962 

7.969 

7.975 

7.981 

7.987 

7.994 

64 

8.000 

8.006 

8.012 

8.019 

8.025 

8.031 

8.037 

8.044 

8.050 

8.056 

65 

8.062 

8.068 

8.075 

8.081 

8.087 

8.093 

8.099 

8.106 

8.112 

8.118 

66 

8.124 

8.130 

8.136 

8.142 

8.149 

8.155 

8.161 

8.167 

8.173 

8.179 

67 

8.185 

8.191 

8.198 

8.204 

8.210 

8.216 

8.222 

8.228 

8.234 

8.240 

68 

8.246 

8.252 

8.258 

8.264 

8.270 

8.276 

8.283 

8.289 

8.295 

8.301 

69 

8.307 

8.313 

8.319 

8.325 

8.331 

8.337 

8.343 

8.349 

8.355 

8.361 

70 

8.367 

8.373 

8.379 

8.385 

8.390 

8.396 

8.402 

8.408 

8.414 

8.420 

71 

8.426 

8.432 

8.438 

8.444 

8.450 

8.456 

8.462 

8.468 

8.473 

8.479 

72 

8.485 

8.491 

8.497 

8.503 

8.509 

8.515 

8.521 

8.526 

8.532 

8.538 

73 

8.544 

8.550 

8.556 

8.562 

8.567 

8.573 

8.579 

8.585 

8.591 

8.597 

74 

8.602 

8.608 

8.614 

8.620 

8.626 

8.631 

8.637 

8.643 

8.649 

8.654 

1 

75 

8.660 

8.666 

8.672 

8.678 

8.683 

8.689 

8.695 

8.701 

8.706 

8.712 

76 

8.718 

8.724 

8.730 

8.735 

8.741 

8.746 

8.752 1 

8.758 

8.764 

8.769 

77 

8.775 

8.781 

8.786 

8.792 

8.798 

8.803 

8.809 1 

8.815 

8.820 

8.826 

78 

8.832 

8.837 

8.843 

8.849 

8.854 

8.860 

8.866 

8.871 

8.877 

8.883 

79 

8.888 

8.894 

8.899 

8.905 

8.911 

8.916 

8.922 

8.927 

8.933 

8.939 

m 

8.944 

8.950 

8.955 

8.961 

8.967 

8.972 

8.978 

8.983 

8.989 

8.994 

81 

9.000 

9.006 

9.011 

9.017 

9.022 

9.028 

9.033 

9.039 

9.044 

9.050 

82 

9.055 

9.061 

9.066 

9.072 

9.077 

9.083 

9.088 

9.094 

9.099 

9.105 

83 

9.110 

9.116 

9.121 

9.127 

9.132 

9.138 

9.143 

9.149 

9.154 

9.160 

84 

9.165 

9.171 

9.176 

9.182 

9.187 

9.192 

9.198 

9.203 

9.209 

9.214 

85 

9.220 

9.225 

9.230 

9.236 

9.241 

9.247 

9.252 

9.257 

9.263 

9.268 

86 

9.274 

9.279 

9.284 

9.290 

9.295 

9.301 

9.306 

9.311 

9.317 

9.322 

87 

9.327 

9.333 

9.338 

9.343 

9.349 

9.354 

9.359 

9.365 

9.370 

9.376 

88 

9.381 

9.386 

9.391 

9.397 

9.402 

9.407 

9.413 

9.418 

9.423 

9.429 

89 

9.434 

9.439 

9.445 

9.450 

9.455 

9.460 

9.466 

9.471 

9.463 

9.482 

90 

9.487 

9.492 

9.497 

9.503 

9.508 

9.513 

9.518 

9.524 

9.529 

9.534 

91 

9.539 

9.545 

9.550 

9.555 

9.560 

9.566 

9.m 

9.576 

9.581 

9.586 

92 

9.592 

9.597 

9.602 

9.607 

9.612 

9.618 

9.623 

9.628 

9.633 

9.638 

93 

9.644 

9.649 

9.654 

9.659 

9.664 

9.670 

9.675 

9.680 

9.685 

9.690 

94 

9.695 

9.701 

9.706 

9.711 

9.716 

9.721 

9.726 

9.731 

9.737 

9.742 

95 

9.747 

9.752 

9.757 

9.762 

9.767 

9.772 

9.778 

9.783 

9.788 

9.793 

96 

9.798 

9.803 

9.808 

9.813 

9.818 

9.823 

9.829 

9.834 

9.839 

9.844 

97 

9.849 

9.854 

9.859 

9.864 

9.869 

9.874 

9.879 

9.884 

9.889 

9.894 

98 

9.899 

9.905 

9.910 

9.915 

9.920 

9.925 

9.930 

9.935 

9.9^ 

9.945 ^ 


9.950 

9.955 

9.960 

9.965 

9.970 

9.975 

9.9(^ 

9.985 

9.990 

9.995 
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Table IV. —^Square Roots of Numbers from 100 to 1000^ 


N 

0 

■ 

2 

3 

4 

5 


H 

8' 

9 

100 

10.00 

10.05 

10.10 

10.15 

10.20 

10.25 

10.30 

10.34 

10.39 

10.44 

110 

10.49 

10.54 

10.58 

10.63 

10.68 

10.72 

10.77 

10.82 

10.86 

10.91 

120 

10.95 

11.00 

11.05 

11.09 

11.14 

11.18 

11.22 

11.27 

11.31 

11.36 

130 

11.40 

11.45 

11.49 

11.53 

11.58 

11.62 

11.66 

11.70 

11.75 

11.79 

140 

11.83 

11.87 

11.92 

11.96 

12.00 

12.p4 

12.08 

12.12 

12.17 

12.21 

150 

12.25 

12.29 

12.33 

12.37 

12.41 

12.45 

12.49 

12.53 

12.57 

12.61 

160 

12.65 

12.69 

12.73 

12.77 

12.81 

12.85 

12.88 

12.92 

12.96 

13.00 

170 

13.04 

13.08 

13.11 

13.15 

13.19 

13.23 

13.27 

13.30 

13.34 

13.38 

180 

13.42 

13.45 

13.49 

13.53 

13.56 

13.60 

13.64 

13.67 

13.71 

13.75 

190 

13.78 

13.82 

13.86 

13.89 

13.93 

13.96 

14.00 

14.04 

14.07 

14.11 

200 

14.14 

14.18 

14.21 

14.25 

14.28 

14.32 

14.35 

14.39 

14.42 

14.46 

210 

14.49 

14.53 

14.56 

14.59 

14.63 

14.66 

14.70 

14.73 

14.76 

14.80, 

220 

14.83 

14.87 

14.90 

14.93 

14.97 

15.00 

15.03 

15.07 

15.10 

15.13 

230 

15.17 

15.20 

15.23 

15.26 

15.30 

15.83 

15.36 

15.39 

15.43 

15.46 

240 

15.49 

15.52 

15.56 

15.59 

15.62 

15.65 

15.68 

16.72 

15.75 

15.78 

250 

15.81 

15.84 

15.87 

15.91 

15.94 

15.97 

16.00 

16.03 

16.06 

16.09 

260 

16.12 

16.16 

16.19 

16.22 

16.25 

16.28 

16.31 

16.34 

16.37 

16.40 

270 

16.43 

16.46 

16.49 

16.52 

16.65 

16.68 

16.61 

16.64 

16.67 

16.70 

280 

16.73 

16.76 

16.79 

16.82 

16.85 

16.88 

16.91 

16.94 

16.97 

17.00 

290 

17.03 

17.06 

17.09 

17.12 

17.15 

17.18 

17.20 

17.23 

17.26 

17.29 

300 

17.32 

17.35 

17.38 

17.41 

17.44 

17.46 

17.49 

17.62 

17.65 

17.58 

310 

17.61 

17.64 

17.66 

17.69 

17.72 

17.75 

17.78 

17.80 

17.83 

17.86 

320 

17.89 

17.92 

17.94 

17.97 

18.00 

18.03 

18.06 

18.08 

18.11 

18.14 

330 

18.17 

18.19 

18.22 

18.25 

18.28 

18.30 

18.33 

18.36 

18.38 

18.41 

340 

18.44 

18.47 

18.49 

*18.52 

18.55 

18.57 

18.60 

18.63 

18.65 

18.68 

350 

18.71 

18.74 

18.76 

18.79 

18.81 

18.84 

18.87 

18.89 

18.92 

18.95 

360 

18.97 

19.00 

19.03 

19.05 

19.08 

19.10 

19.13 

19.16 

19.18 

19.21 

370 

19.24 

19.26 

19.29 

19.31 

19.34 

19.36 

19.39 

19.42 

19.44 

19.47 

380 

19.49 

19.52 

19.54 

19.57 

19.60 

19.62 

19.65 

19.67 

19.70 

19.72 

390 

19.75 

19.77 

19.80 

19.82 

19.85 

19.87 

19.90 

19.92 

19.95 

19.98 

400 

20.00 

20.02 

20.05 

20.07 

20.10 

20.12 

20.15 

20.17 

20.20 

20.22 

410 

20.25 

20.27 

20.30 

20.32 

20.35 

20.37 

20.40 

20,42 

20.44 

20.47 

420 

20.49 

20.52 

20.54 

20.57 

20.59 

20.62 

20.64 

20.66 

20.69 

20.71 

430 

20.74 

20.76 

20.78 

20.81 

20.83 

20.86 

20.88 

20.90 

20.93 

20.95 

440 

20.98 

21.00 

21.02 

21.05 

21.07 

21.10 

21.12 

21.14 

21.17* 

21.19 

450 

21.21 

21.24 

21.26 

21.28 

21.31 

21.33 

21.35 

21.38 

21.40 

21.42 

460 

21.45 

21.47 

21.49 

21.52 

21.54 

21,56 

21.69 

21.61 

21.63 

21.66 

470 

21,68 

21.70 

21.73 

21.75 

21.77 

21.79 

21.82 

21.84 

21.86 

21.89 

480 

21.91 

21.93 

21.95 

21.98 

22.00 

22.02 

22.05 

22.07 

22.09 

22.11 

490 

22.14 

22.16 

22.18 

22.20 

22.23 

22.26 

22.27 

22.29 

22.32 

22.34 

500 

510 

520 

530 

540 

550 

22.36 

22.58 

22.80 

23.02 

23.24 

23.45 

22.38 

22.61 

22.83 

23.04 

23.26 

23.47 

22.41 

22.63 

22.85 

23.07 

23.28 

23.49 

22.43 

22.65 

22.87 

23.09 

23.30 

23.52 

22.45 

22.67 

22.89 

23.11 

23.32 

23.54 

22.47 

22.69 

22.91 

23.13 

23.36 

23.66 

22.49 

22.72 

22.93 

23.15' 

23.37 

23.58 

22.52 

22.74 

22.96 

23.17 

23.39 

23.60 

22.54 

22.76 

22.98 

23.19 

^3.41 

23.62 

22.56 

22.78 

23.00 

23.22 

23.^ 

23.64 


1 Souroev WAtraa, Amitot E.. Lahoralary Manual and Problems for Elemenls of StatistKol 
Method (McGraw-HUl Book Company, Inc., 1944). 
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Table IV. —Sqttabb Roots of Nxtmbebs fbom 100 to 1000. — (Continwd) 


N 

0 

n 


3 

n 


6 

n 

8 

9 

550 

23.45 

23.47 

23.49 

23.52 

23.54 

23.56 

23.58 

23.60 

23.62 

23.64 

560 

23.66 

23.69 

23.71 

23.73 

23.75 

23.77 

23.79 

23.81 

23.83 

23.85 

570 

23.87 

23.90 

23.92 

23.94 

23.96 

23.98 

24.00 

24.02 

24.04 

24.06 

580 

24.08 

24.10 

24.12 

24.15 

24.17 

24.19 

24.21 

24.23 

24.25 

24.27 

590 

24.29 

24.31 

24.33 

24.35 

24.37 

24.39 

24.41 

24.43 

24.45 

24.47 

600 

24.49 

24.52 

24.54 

24.56 

24.58 

24.60 

24.62 

24.64 

24.66 

24.68 

610 

24.70 

24.72 

24.74 

24.76 

24.78 

24.80 

24.82 

24.84 

24.86 

24.88 

620 

24.90 

24.92 

24.94 

24.96 

24.98 

25.00 

25.02 



25.08 

630 

25.10 

25.12 

25.14 

25.16 

25.18 

25.20 

25.22 



25.28 

640 

25.30 

25.32 

25.34 

25.36 

25.38 

25.40 

25.42 

25.44 

25.46 

25.48 

650 

25.50 

25.51 

25.53 

25.55 

25.57 

25.59 

25.61 

25.63 

25.65 

25.67 

660 

25.69 

25.71 

25.73 

25.75 

25.77 

25.79 

25.81 

25.83 

25.85 

25.86 

670 

25.88 

25.90 

25.92 

25.94 

25.96 

25.98 

26.00 

26.02 

26.04 

26.06 

680 

26.08 

26.10 

26.12 

26.13 

26.15 

26.17 

26.19 

26.21 

26.23 

26.25 

690 

26.27 

26.29 

26.31 

26.32 

26.34 

26.36 

26.38 

26.40 

26.42 

26.44 

700 

26.46 

26.48 

26.50 

26.51 

26.53 

26.55 

26.57 

26.59 

26.61 

26.63 

710 

26.65 

26.66 

26.68 

26.70 

26.72 

26.74 

26.76 

26.78 

26.80 

26.81 

720 

26.83 

26.85 

26.87 

26.89 

26.91 

26.93 

26.94 

26.96 

26.98 

27.00 

730 

27.02 

27,04 

27.06 

27.07 

27.09 

27.11 

27.13 

27.15 

27.17 

27.18 

740 

27.20 

27.22 

27.24 

27.26 

27.28 

27.29 

27.31 

27.33 

27.35 

27.37 

750 

27.39 

27.40 

27.42 

27,44 

27.46 

27.48 

27.50 

27.51 

27.53 

27.55 

760 

27.57 

27.59 

27.60 

27.62 

27.64 

27.66 

27.68 

27.69 

27.71 

27.73 

770 

27.75 

27.77 

27.78 

27.80 

27.82 

27.84 

27.86 

27.87 

27.89 

27.91 

780 

27.93 

27.95 

27.96 

27.98 

28.00 

28.02 

38.04 

28.05 

28.07 

28.09 

790 

28.11 

28.12 

28.14 

28.16 

28.18 

28.20 

28.21 

28.23 

28.25 

28.27 


28.28 

28.30 

28.32 

28.34 

28.35 

28.37 

28.39 

28.41 

28.43 

28.44 

810 

28.46 

28.48 

28.50 

28.51 

28.53 

28.55 

28.57 

28.58 

28.60 

28.62 

820 

28.64 

28.65 

28,67 

28.69 

28.71 

28.72 

28.74 

28.76 

28.78 

28.79 

830 

28.81 

28.83 

28.84 

28.86 

28.88 

28.90 

28.91 

28.93 

28.95 

28.97 

840 

28.98 

29.00 

29.02 

29.03 

29.05 

29.07 

29.09 

29.10 

29.12 

29.14 

850 

29.15 

29.17 

29.19 

29.21 

29.22 

29.24 

29.26 

29.27 

29.29 

29.31 

860 

29.33 

29.34 

29.36 

29.38 

29.39 

29.41 

29.43 

29.44 

29.46 

29.48 

870 

29.50 

29.51 

29.53 

29.55 

29.56 

29.58 

29.60 

29.61 

29.63 

29.65 


29.66 

1 29.68 

29.70 

29-72 

29.73 

29.75 

29.77 

29.78 

29.80 

29.82 


; 29.83 

29.85 

29.87 

29.88 

29.90 

29.92 

29.93 

4 

30.10 

29.95 

29.97 

29.98 

900 

30.00 

30.02 

130.03 

30.05 

30.07 

30.08 

30.12 

30.13 

30.15 

910 

30.17 

30.18 

30.20 

30.22 

30.23 

30.25 

30.27 

30.28 

30.30 

30.32 

920 

30.33 

30.35 

30.36 

30.38 

30.40 

30.41 

30.43 

30.45 

30.46 

30.48 


30.50 

30.51 

30.53 

30.54 

30.56 

30.58 

30.59 

30.61 

30.63’ 

30.64 

940 

30.66 

30.68 

30.69 

30.71 

30.72 

30.74 

30.76 

30.77 

30.79 

30.81 

959 

30.82 

30.84 

30.85 

30.87 

30.89 

30.90 

30.92 

30.94 

30.95 

30.97 

960 

30.98^ 

31.00 

31.02 

31.03 

31.05 

31.06 

31.08 

31.10 

31.11 

31.13 

970 

31.14 

31.16 

31.18 

31.19 

31.21 

31.22 

31.24 

31.26 

31.27 

31.29 

980 

31.30 

31.32 

31.34 

31.35 

31.37 

31.38 

31.40 

31.42 

31.43 

31.45 


31.46 

31.48 

31.50 



31.54 

31.56 

31.58 

31.59 

31.61 






































APPENDIX 


691 


Table V.—^Reciprocals op Numbers^ 


N 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.98 

.09 

1.00 


.9901 

.9804 

.9709 

.9615 

.9524 

.9434 

.9346 

.9259 

.9174 

mo 

.9091 

.9009 

.8929 

.8850 

.8772 

.8696 

.8621 

.8547 

.8475 

.8403 


.8333 

.8264 

.8197 

.8130 

.8065 

.8000 

.7937 

.7874 

.7812 

.7752 


.7692 

.7634 

.7576 

. .7519 

.7463 

-7407 

.7353 

.7299 

.7246 

.7194 

1.40 

.7143 

.7092 

.7042 

.6993 

.6944 

.6897 

.6849 

.6803 

.6757 

.6711 

1.50 

.6667 

.6623 

.6579 

.6536 

.6494 

.6452 

• 

.6410 

.6369 

.6329 

.6289 

1.60 

.6250 

.6211 

.6173 

.6135 

.6098 

.6061 

.6024 

.5988 

.5952 

.5917 

1.70 


.5848 

.5814 

.5780 

.5747 

.5714 

.5682 

.5650 

.5618 

.5587 

1.80 


.5525 

.5495 

.5464 

.5435 

.5405 

.5376 

.5348 

.5319 

.5291 

1.90 


.5236 

.5208 

.5181 

.5155 

.5128 

.5102 

.5076 

.5051 

.5025 

2.00 


.4975 

.4950 

.4926 

.4902 

.4878 

.4854 

.4831 

.4808 

.4785 

2.10 

.4762 

.4739 

.4717 

.4694 

.4673 

.4651 

.4630 

.4608 

.4587 

.4566 

2.20 

.4545 

.4525 

.4504 

.4484 

.4464 

,4444 

.4425 

.4405 

.4386 

.4367 

2.30 

.4348 

.4329 

.4310 

.4292 

.4274 

.4255 

.4237 

.4219 

.4202 

.4184 

2.40 

.4167 

.4149 

.4132 

.4115 

.4098 

.4082 

.4065 

.4049 

.4032 

.4016 

2.50 

.4000 

.3984 

.3968 

.3953 

.3937 

.3922 

.3906 

.3891 

.3876 

.3861 

2.60 

.3846 

.3831 

.3817 

.3802 

.3788 

.3774 

.3759 

.3745 

.3731 

.3717 

2.70 

.3704 

.3690 

.3676 

.3663 

.3650 

.3636 

.3623 

.3610 

.3597 

.3584 

2.80 

.3571 

.3559 

.3546 

.3534 

.3521 

.3509 

.3496 

.3484 

.3472 

.3460 

2.90 

.3448 

.3436 

.3425 

.3413 

.3401 

.3390 

.3378 

.3367 

.3356 

.3344 

3.00 

.3333 

.3322 

.3311 

.3300 

.3289 

.3279 

.3268 

.3257 

.3247 

.3236 

3.10 

.3226 

.3215 

.3205 

.3195 

.3185 

.3175 

.3165 

.3155 

.3145 

.3135 

3.20 

.3125 

.3115 

.3106 

.3096 

.3086 

.3077 

.3067 

.3058 

.3049 

.3040 

3.30 

.3030 

.3021 

.3012 

.3003 

.2994 

.2985 

.2976 

.2967 

.2959 

.2950 

3.40 

.2941 

.2933 

.2924 

.2915 

.2907 

.2899 

.2890 

.2882 

.2874 

.2865 

3.50 

.2857 

.2849 

.2841 

.2833 

,2825 

.2817 

.2809 

.2801 

.2793 

.2786 

3.60 

.2778 

.2770 

.2762 

.2755 

.2747 

.2740 

.2732 

.2725 

.2717 

.2710 

3.70 

.2703 

.2695 

.2688 

.2681 

.2674 

.2667 

.2660 

.2653 

.2646 

.2639 

3.80 

.2632 

.2625 

.2618 

.2611 

.2604 

.2597 

.2591 

.2584 

.2577 

.2571 

3.90 

.2564 

.2558 

.2551 

.2545 

.2538 

.2532 

.2525 

.2519 

.2513 

.2506 

4.00 

.2500 

.2494 

.2488 

.2481 

.2475 

.2469 

.2463 

.2457 

.2451 

.2445 

4.10 

.2439 

.2433 

.2427 

.2421 

.2415 

.2410 

.2404 

.2398 

.2392 . 

.2387 

4.20 

.2381 

.2375 

.2370 

.2364 

.2358 

.2353 

.2347 

.2342 

.2336 

.2331 

4.30 

.2326 

.2320 

.2315 

.2309 

.2304 

.2299 

.2294 

.2288 

.2283 

.2278 

4.40 

.2273 

.2268 

.2262 

.2257 

.2252 

.2247 

.2242 

.2237 

.2232 

.2227 

4.50 

.2222 

.2217 

.2212 

.2208 

.2203 

.2198 

.2193 

.2188 

.2183 

.2179 

4.60 

.2174 

.2169 

.2164 

.2160 

.2155 

.2151 

.2146 

.2141 

.2137 

.2132 

4.70 

-2128 

.2123 

.2119 

.2114 

'.2110 

.2105 

.2101 

.2096 

.2092 

.2088 

4.80 

.^83 

.2079 

.2075 

.2070 

.2066 

.2062 

.2058- 

.2053 

.2049 

.2045 

4.90 

.'2041 

.2037 

.2033 

.2028 

.2024 

.2020 

.2016 

.2012 

.2008 

.2004 

5.00 


.1996 

.1992 

.1988 

.1984 

.1980 

.1976 

.1972 

.1968 

.1965 

6.10 

.1961 

.1957 

.1953 

.1949 

.1946 

.1942 

.1938 

.1934 

.1930 

.1927 

5.20 

.1923 

.1919 

.1916 

.1912 

.1908 

.1905 

.1901 

.1898 

.1894 

.1800 

6.30 

-1887 

.1883 

.1880 

.1876 

.187? 

.1869 

.1866 

.1862 

.1859 

.1855 

6.40 

.1852 

.1848 

.1845 

.1842 

i 

.1838 

.1835 

.1832 

.1828 

.1825 

.1821 


1 Source: Waugh, Aubbbt E., Laboratory Manual and ProUemafor Element of Statistiad 
Method (McGraw-Hill Book Company, Inc., 1944), 
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Table V. —^Reciprocals op Numbers.— {Continued) 


N 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

'.09 

6.50 

.1818 

.1815 


.1808 

H 

.1802 

.1799 

.1795 

.1792 

.1789 

6.60 

.1786 

.1783 

■tWEl 

■ g 


mkkil 

.1767 

.1764 

.1761 

.1757 

6.70 

.1764 

.1751 

■BKl 


.1742 

.1739 

.1736 

.1733 

.1730 

.1727 

6.80 

.1724 

.1721 

■BB!! 


.1712 

.1709 

.1706 

.1704 

.1701 

.1698 

6.90 

.1695 

.1692 

.1689 

.1686 

.1684 

.1681 

.1678 

.1675 

.1672 

.1669 

HB 

.1667 

.1664 

.1661 

.1658 

.1656 

.1653 

.1650 

.1647 

.1645 

.1642 

6.10 

.1639 

.1637 

.1634 

.1631 

.1629 

.1626 

.1623 

.1621 

.1618 

.1616 

6.20 

.1613 

.1610 

.1608 

.1605 

.1603 

.1600 

.1597 

.1595 

.1592 

.1590 

6.30 

.1587 

.1585 

.1582 

.1580 

.1577 

.1575 

.1572 

.1570 

.1567 

.1565 

6.40 

.1562 

.1560 

.1558 

.1555 

.1553 

.1550 

.1548 

.1546 

.1543 

.1541 

6.50 

.1538 

.1536 

.1534 

.1531 

.1529 

.1527 

.1524 

.1522 

.1520 

.1617 

6.60 

.1515 

.1513 

.1511 

.1508 

.1506 

.1504 

.1502 

.1499 

.1497 

1 

6.70 

.1493 

.1490 

.1488 

.1486 

.1484 

.1481 

.1479 

.1477 

.1475 

1 'fill 

6.80 

.1471 

.1468 

.1466 

.1464 

.1462 

.1460 

.1458 

.1456 

.1453 

1 ' 1 

6.90 

.1449 

.1447 

.1445 

.1443 

.1441 

.1439 

.1437 

.1435 

.1433 

1 il 


.1429 

.1427 

.1424 

.1422 

.1420 

.1418 

.1416 

.1414 

.1412 

.1410 

7.10 

.1408 

.1406 

.1404 

.1403 

.1401 

.1399 

.1397 

.1395 

.1393 

.1391 

7.20 

.1389 

.1387 

.1385 

.1383 

.1381 

.1379 

.1377 

.1376 

.1374 

.1372 

7.30 

.1370 

.1368 

.1366 

.1364 

.1362 

.1361 

.1359 

.1357 

.1355 

.1353 

7.40 

.1351 


.1348 

.1346 

.1344 

.1342 

.1340 

.1339 

.1337 

.1335 

7.50 

.1333 

.1332 

.1330 

.1328 

.1326 

.1324 

.1323 

.1321 


.1318 

7.60 

.1316 

.1314 

.1312 

.1311 

.1309 

.1307 

.1305 

.1304 

It 


7.70 

.1299 

.1297 

.1295 

.1294 

.1292 

.1290 

.1289 

.1287 


.1284 

7.80 

.1282 

.1280 

.1279 

.1277 

.1276 

.1274 

.1272 

.1271 

.1269 

.1267 

7.90 

.1266 

.1264 

.1263 

.1261 

.1259 

.1258 

.1256 

.1255 

.1253 

.1252 

8.00 

.1250 

.1248 

.1247 

Wm 

.1244 

.1242 

.1241 

.1239 

.1238 

.1236 

8.10 

.1235 

.1233 

.1232 

BIS 

.1228 

.1227 

.1225 

.1224 

It 

.1221 

8.20 

.1220 

.1218 

.1217 


.1214 

.1212 

.1211 

.1209 

II 

.1206 

8.30 

.1205 

.1203 

.1202 

,1200 

.1199 

.1198 

.1196 

.1195 


.1192 

8.40 

.1190 

.1189 

.1188 

.1186 

.1185 

.1183 

.1182 

.1181 

.1179 

.1178 

8.60 

.1176 

.1175 

.1174 

.1172 

.1171 

.1170 

.1168 

.1167 

.1166 

.1164 

8.60 

.1163 

.1161 

.1160 

.1159 

.1157 

.1156 

.1155 

.1153 

.1152 

.1151 

8.70 

.1149 

.1148 

.1147 

.1145 

.1144 

.1143 

.1142 

.1140 

.1139 

.1138 

8.80 

.1136 

.1135 

.1134 

.1132 

.1131 

.1130 

.1129 

.1127 

.1126 

.1125 

8.90 

.1124 

.1122 

.1121 

.1120 

.1119 

.1117 

,1116 

.1115 

.1114 

.1112 

9.00 

.1111 

.1110 

.1109 

.1107 

' .1106 

.1105 

.1104 

.1103 

.1101 

.1100 

9.10 

.1099 

.1098 

.1096 

.1095 

.1094 

.1093 

.1092 

.1091 

.1089 

.1088 

9.20 

.1087 

.1086 

.1085 

.1083 

.1082 

.1081 

.1080 

.1079 

.1078 

.1076 

9.30 

.1076 

.1074 

.1073 

.1072 

.1071 

.1070 

.1068 

.1067 

.1066 

.1065 

9.40 

.1064 

.1063 

.1062 

.1060 

.1059 

.1Q68 

1 .1057 

.1056 

.1055 

.1054 

9.60 

.1053 

• 1052 

.1050 

.1049 

.1048 

.1047 

.1046 

.1045 

.1044 

.1043 

9.60 

.1042 

.1041 

.1040 

.1038 

.1037 

.1036 

.1035 

.1034 

.1033 


9.70 

.1081 

.1030 

.1029 

.1028 

.1027 

.1026 

.1025 

.1024 

.1022 


9.80 

.1020 

.1019 

.1018 

.1017 

.1016 

.1016 

.1014 

.1013 

.1012 


9.90 

.loia 

.10(» 

.1008 

.1007 

.1006 

.1005 

.1004 

.1003 

.1002 

.1001 
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Table VI.—^Areas under the Normal Curve 
FrdiCtioiidil pdrts of tho totRl ErcR (1.000) under the normal curve between 
the mean and a perpendicular erected at various numbers of standard 
deviations {x/a) from the mean.i To illustrate the use of the table, 39.065 
per cent of the total area under the curve will lie between the mean and a 
perpendicular erected at a distance of 1.23<r from the mean. 

Each figure in the body of the table is preceded by a decimal point. 


x/^ 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

0.0 


00399 

00798 

01197 

01696 

01994 

02392 

02790 

03188 

03686 

0.1 

03983 

04380 

04776 

06172 

06567 

05962 

06366 

06749 

07142 

07535 

0.2 

07926 

08317 

08706 

09095 

09483 

09871 

10257 

10642 

11026 

11409 

0.3 

11791 

12172 

12562 

12930 

13307 

13683 

14068 

14431 

14803 

15J73 

0.4 

16564 

16910 

16276 

16640 

17003 

17364 

17724 

18082 

18439 

18793 

0.6 

19146 

19497 

19847 

20194 

20450 

20884 

21226 

21666 

21904 

22240 

0.6 

22575 

22907 

23237 

23565 

23891 

24215 

24537 

24857 

26175 

26490 



26115 

26424 

26730 

27035 

27337 

27637 

27935 

28230 

28624 

0.8 

28814 

29103 

29389 

29673 

29966 

30234 

30611 

30785 

31067 

31327 

0.9 

31594 

31859 

32121 

32381 

32639 

32894 

33147 

33398 

33646 

33891 

1.0 

34134 

34375 

34614 

34860 

36083 

35313 

36643 

35769 

36993 

36214 

1.1 

36433, 

36660 

36864 

37076 

37286 

37493 

37698 

37900 

38100 

38298 

1.2 

38493 

38686 

38877 

39066 

39251 

39435 

39617 

39796 

39973 

40147 

1.3 

40320 

40490 

40658 

40824 

40988 

41149 

41308 

41466 

41621 

41774 

1.4 

41924 

42073 

42220 

42364 

42607 

42647 

42786 

42922 

43056 

43189 

1.6 

43319 

43448 

43574 

43699 

43822 

43943 

44062 

44179 

44296 

44408 

1.6 

44520 

44630 

44738 

44846 

44950 

45053 

46164 

45264 

46362 

46449 

1.7 

46543 

46637 

46728 

46818 

45907 

45994 

46080 

46164 

46246 

46327 

1.8 

46407 

46486 

46562 

46638 

46712 

46784 

46866 

46926 

46995 

47062 

1.9 

47128 

47193 

47267 

47320 

47381 

47441 

47500 

47668 

47616 

47670 

2.0 

47725 

47778 

47831 

47882 

47932 

47982 

48030 

48077 

48124 

48169 

2.1 

48214* 

48257 

48300 

48341 

48382 

48422 

48461 

48600 

48537 

48674 

2.2 

48610 

48646 

48679 

48713 

48745 

48778 

48809 

48840 

48870 

48899 

2.3 

48928 

48966 

48983 

49010 


49061 

49086 

49111 

49134 

49158 

2.4 

49180 

49202 

49224 

49246 

49266 

49286 

49305 

49324 

49343 

49361 

2.5 

49379 

49396 

49413 

49430 

49446 

1 49461 

49477 

49492 

49506 

49520 

2.6 

49634 

49647 

49660 

49573 

49685 

49598 

49609 

1 49621 

49632 

49643 

2.7 

49663 

49664 

49674 

49683 

49693 

49702 

49711 

49720 

1 49728 

49736 

2.8 

49744 

49752 

49760 

49767 

49774 

49781 

49788 

49795 


49807 

2.9 

3.0 ' 

3.6 

4.0 

4.5 

0.0 

49813 

49865 

4997674 

4999683 

4999966 

4909997133 

49819 

49826 

49831 

1 

49841 

49846 

49851 

1 

49861 


1 This table has been adapted, by permission, fr^m F, C. Kent, '*lQlf;menta of Statistics’ 
(McGraw-Hill Book Company, Inc., 1924). 
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Table VII.— Ordinates op the Normal Curve 
Ordinates (heights) of the standard normal curve. ^ The height {y) at 
any distance (x) from the mean is 

_£2 

y - 0.39894e 2 

To make the curve fit a histogram in which the abscissa scale is measured 
in original x units instead of standard-deviation (x/a) units, multiply these 
ordinates by Ni/a where N is the number of cases, i the class interval, and 
tr the standard deviation. 

Each figure in the body of the table is preceded by a decimal point. 


x/v 

.00 

.01 

.02 

.03 

.04 

.05 

.06 

.07 

.08 

.09 

0.0 

39894 

39892 

39886 

39876 

39862 

39844 

39822 

39797 

39767 

39733 

0.1 

39695 

39654 

39608 

39559 

39505 

39448 

39387 

39322 

39253 

39181 

0.2 

39104 

39024 

38940 

38853 

38762 

38667 

38568 

38466 

38361 

38251 

0.3 

38139 

38023 

37903 

37780 

37654 

37524 

37391 

37255 

37115 

36973 

0.4 

36827 

36678 

36526 

36371 

36213 

36053 

35889 

35723 

35553 

35381 

0.5 

36207 

35029 

34849 

34667 

34482 

34294 

34105 

33912 

33718 

33521 

0.6 

33322 

33121 

32918 

32713 

32506 

32297 

32086 

31874 

31659 

31443 

0.7 

31225 

31006 

30785 

30563 

30339 

30114 

29887 

29658 

29430 

29200 

0.8 

28969 

28737 

28504 

28269 

28034 

27798 

27562 

27324 

27086 

26848 

0.9 

26609 

26369 

26129 

25888 

26647 

25406 

25164 

24923 

24681 

24439 

1.0 

24197 

23966 

23713 

23471 

23230 

22988 

22747 

22506 

22265 

22025 

1.1 

21785 

21546 

21307 

21069 

20831 

20594 

20357 

20121 

19886 

19652 

1.2 

19419 

19186 

18954 

18724 

18494 

18265 

18037 

17810 

17585 

17360 

1.3 

17137 

16915 

16694 

16474 

16256 

16038 

15822 

15608 

15395 

15183 

1.4 

14973 

14764 

14556 

14350 

14146 

13943 

13742 

13542 

13344 

13147 

1.6 

12952 

12768 

12566 

12376 

12188 

12001 

11816 

11632 

11450 

11270 

1.6 

11092 

10916 

10741 

10667 

10396 

10226 

10059 

09893 

09728 

09566 

1.7 

09405 

09246 

09089 

08933 

08780 

08628 

08478 

08329 

08183 

08038 

1.8 

07895 

07764 

07614 

07477 

07341 

07206 

07074 

06943 

06814 

06687 

1.9 

06562 

06438 

1 06316 

06195 

06077 

05959 

05844 

05730 

05618 

05508 

2.0 

05399 

05292 

' 05186 

05082 

04980 

04879 

04780 

04682 

04586 

04491 

2.1 

04398 

04307 

04217 

04128 

04041 

03955 

03871 

03788 

03706 

03626 

2.2 

03647 

03470 

03394 

03319 

03246 

WME 

03103 

03034| 

02965 

02898 

2.3 

02833 

02768 

02705 

02643 

02582 

02522 

02463 


02346 

02294 

2.4 

02239 

02186 

02134 

02083 

02033 

01984 

01936 


01842 

01797 

2.5 

01753 

01709 

01667 

01625 

01585 

01545 

01506 

01468 

01431 

01394 

2.6 

01358 

01323 

01289 

01256 

01223 

01191 

01160 

01130 

01100 

01071 

2.7 

01042 

01014 

00987 

00961 

00935 

00909 

00885 


00837 

00814 

2.8 

00792 

00770 

00748 

00727 

00707 

00687 

00668 


00631 

00613 

2.9 

8.0 

3.5 

4.0 

4.5 

5.0 

00595 

00443 

0008727 

0001338. 

0000160 

000001487 

00678 

00562 

1 

00545 

1 

00530 


00499 

1 

00470 

00457 

1 


•‘'i This table adapted, by permission, from Kent, “Elements of Statistics.” 

Ordinates may also be computed from the equation log y = 9.600910 — 10 — 0.217147 x* 
ami for ordinates beyelnd Zo it would be necessary to use log y — 9.60091006^8 — 10 — 
0.2171472910X?. 
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Table VIII. —Hyperbolic Tangents^ 


z 

r = tanh z 

z 

\ 

r = tanh z 

z 

r = tanh z 

0.00 

0.00000 

0.55 

0.50052 

1.10 

0.80050 

0.10 

.01000 

0.56 

.50798 

1.11 

.80406 

0.02 

.02000 

0.57 

.51536 

1.12 

.80757 

0.03 

.02999 

0.58 

.52267 

1.13 

.81102 

0.04 

.03998 

0.59 

.52990 

1.14 

.81441 

0.05 

0.04996 

0.60 

0.53705 

1.15 

0.81775 

0.06 

.05993 

0.61 

.54413 

1.16 

.82104 

0.07 

.06989 

0.62 

.55113 

1.17 

>82427 

0.08 

.07983 

0.63 

.55805 

1.18 

>82745 

0.09 

.08976 

0.64 

.56490 

1.19 

.83058 

0.10 

0.09967 

0.65 

0.57167 

1.20 

0.83365 

0.11 

.10956 

0.66 

.57836 

1.21 

.83668 

0.12 

.11943 

0.67 

.58498 

1.22 

.83965 

0.13 

.12927 

0.68 

.59152 

1.23 

.84258 

0.14 

.13909 

0.69 

.59798 

1.24 

.84546 

0.15 

0.14889 

0.70 

0.60437 

1.25 

0.84828 

0.16 

.15865 

0.71 

.61068 

1.26 

.85106 

0.17 

.16838 

0.72 

.61691 

1.27 

.85380 

0.18 

.17808 

0.73 

.62307 

1.28 

.85648 

0.19 

.18775 

0.74 

.62915 

1.29 

.85913 

0.20 

0.19738 

0.75 

0.63515 

1.30 

0.86172 

0.21 

.20697 

0.76 

.64108 

1.31 

.86428 

0.22 

.21652 

0.77 

.64693 

1.32 

.86678 

0.23 

.22603 

0.78 

.65271 

1.33 

.86925 

0.24 

.23550 

0.79 

.65841 

1.34 

.87167 

0.25 

0.24492 

0.80 

0.66404 

1.35 

0.87405 

0.26 

.25430 

0.81 

.66959 

1.36 

.87639 

0.27 

.26362 

0.82 

.67507 

1.37 

.87869 

0.28 

.27291 

0.83 

.68048 

1.38 

.88095 

0.29 

.28213 

0.84 

.68581 

1.39 

.88317 

0.30 

0.29131 

0.85 

0.69107 

1.40 

0.88535 

0.31 

.30044 

0.86 

.69626 

1.41 

.88749 

0.32 

.30951 

0.87 

.70137 


.88960 

0.33 

.31852 

0.88 

.70642 

1.43 

.89167 

0.34 

.32748 

0.89 

.71139 

1 

.89370 

0.35 

0.33638 

0.90 

0.71630 

1.45 

0.89569 

0.36 

.34521 

0.91 

.72113 

1.46 

.89765 

0.37 

.35399 

0.92 

.72590 

1.47 

.89958 

0.38 

.36271 

0.93 

.73059 

1.48 

.90147 

0.39 

.37136 

0.94 

.73522 

1.49 

.90332 

0.40 

* 0.37995 

0.95 

0.73978 

1 1.50 

0.90515 

0.41 

.38847 

0.96 

.74428 

1 1-51 

.90694 

0.42 

.39693 

0.97 

.74870 


.90870 

0.43 

.40532 

0.98 

.75307 

1.53 

.91042 

0.44 

.41364 

0.99 

.75736 

1 

.91212 

0.45 

0.42190 

1.00 

0.76159 

1 1.55 

0.91379 

0.46 

.43008 

1.01 

.76576 

1.56 

.91542 

0.47 

.43820 

1.02 

.76987 

1.57 

.91703 

0.48 

.44624 

1.03 

.77391 

1.58 

.91860 

0.49 

.45422 

1.04 

.77789 

1.59 

.92015 

0.50 

0.46212 

- 1.05 

0.78181 

1 1.60 

0.92167 

0.51 

.46995 

1.06 

.78566 

1.61 

.92316 

0.52 

.47770 

1.07 

.78946 

1.62 

.92462 

0.53 

;48538 

1.08 

.79320 

1 1.63 

.92606 

0.54 

.49299 

1.09 

.79688 

1.64 

.92747 


1 Source: Hodqman, Cha«lbss C., Mathematical Tables from Handbook of Chemistry and 
Physics (1941). 
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Table VIII. —Hyperbolic Tangents.— (Continued) 


z 

r B tanh z 

z 

r « tanh z 

I 

z 

r "= tanh z 

1.65 

0.92886 

2.20 

0.97574 

2.75 

0.99186 

1.66 

.93022 

2.21 

1 .97622 

2.76 

.99202 

1.67 

.93155 

2.22 

! .97668 

2.77 

.99218 

1.68 

.93286 

2.23 

! .97714 

2.78 

.99233 

1.69 

.93415 

2.24 

.97759 

2.79 

.99248 

1.70 

0.93541 

2.25 

0.97803 

2.80 

0.99263 

1.71 

.93665 

2.26 

.97846 

2.81 

.99278 

1.72 

.93786 

2.27 

.97888 

2.82 

.99292 

1.73 

.93906 

2.28 

.97929 

2.83 

.99306 

1.74 

.94023 

2.29 

.97970 

2.84 

.99320 

1.75 

0.94138 

2.30 

0.98010 

2.85 

0.99333 

1.76 

.94250 

2.31 

.98049 

2.86 

.99346 

1.77 

.94361 

2.32 

.98087 

2.87 

.99359 

1.78 

.94470 

2.33 

.98124 

2.88 

i .99372 

1.79 

.94576 

2.34 

.98161 

2.89 

.99384 

1.80 

0.94681 

2.35 

0.98197 

2.90 

0.99396 

1.81 

.94783 

2.36 

.98233 

2.91 

.99408 

1.82 

.94884 

2.37 

.98267 

2.92 

.99420 

1.83 

.94983 

2.38 

.98301 

2.93 

.99431 

1.84 

.95080 

2.39 

.98335 

2.94 

.99443 

1.85 

0.95175 

2.40 

0.98367 

2.95 

0.99454 

1.86 

.95268 

2.41 

.98400 

2.96 

.99464 

1.87 

.95359 

2.42 

.99431 

2.97 

.99475 

1.88 

.95449 

2.43 

.98462 

2.98 

.99485 

1.89 

.95537 

2.44 

.98492 

2.99 

.99496 

1.90 

0.95624 

2.45 

0.98522 

3.0 

0.99505 

1.91 

.95709 

2.46 

.98551 

3.1 

.99695 

1.92- 

.95792 

2.47 

.98579 

3.2 

.99668 

1.93 

.95873 

2.48 

.98607 

3.3 

.99728 

1.94 

.95953 

2.49 

.98635 

3.4 

.99777 

1.95 

0.96032 

2.50 

0.98661 

3.5 

0.99818 

1.96 

.96109 

2.51 

.98688 

3.6 

.99851 

1.97 

.96185 

2.52 

.98714 

3.7 

.99878 

1.98 

.96259 

2.53 

.98739 

3.8 

.99900 

1.99 

.96331 

2.54 

.98764 

3.9 

.99918 

2.00 

0.96403 

2.55 

0.98788 

4.0 

0.99933 

2.01 

.96473 

2.56 

.98812 

4.1 

.99945 

2.02 

.96541 

2.57 

.98835 

4.2 

.99955 

2.03 

.96609 

2.58 

.98858 

4.3 

.99963 

2.04 

.96675 

2.59 

.98881 

4.4 

.99970 

2.05 

0.96740 

2.60 

0.98903 

4.5 

0.99975 

2.06 

.96803 

2.61 

.98924 

4.6 

.99980 

2.07 

.96865 

2.62 

.98946 

4.7 

.99983 

2.08 

.96926 

2.63 

.98966 

4.8 

.99986 

2.09 

.96986 

2.64 

.98987 

4.9 

.99989 

2.10 

0.97045 

2.65 

0.99007 

6.0 

0.99991 

2.11 

.97103 

2.66 

.99026 



2.12 

.97159 

2.67 

.99045 



2.13 

.97215 

2.68 

.99064 



2.14 

.97269 

2.69 

.99083 



2.15 

0.97323 

2.70 

0.99101 



2.16 

.97375 

2.71 

.99118 



2.17 

.97426 

2.72 

.99136 



2.18 

.97477 

2.73 

.99153 



2.19 

.97526 

2.74 

.99170 
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SUBJECT INDEX 


A 

Accuracy, in calculating statistics, 
230-231 

Agricultural Situation^ 80 
Agricultural Statistics, 80 
U.S. Department of Agriculture, 
80 

Bureau of Agricultural Eco¬ 
nomics, 80-81 
Agricultural Situation, 80 
Agricultural Yearbook, 80 
Crops and Markets, 80 
American Bankers Association, 51 
Analysis of variance, in mutliple cor¬ 
relation, 422-429 
in nonlinear correlation, correla¬ 
tion index, 395 -396 
correlation ratio, 373-376 
in simple correlation, 352-353 
Annalist, 534 

Arithmetic charts, 129-131 
Array, 139-140 
Asymmetry {see Skewness) 
Attributes, variable, 157 
Averages {see Frequency distribu¬ 
tions, averages) 

Avogadro's law, 57 

B 

Banking statistics, sources of, 79- 
86 

Federal Reserve, Board of Gover¬ 
nors, 82 

Federal Reserve Bulletin, 82 
Member Bank Call Report, 82 
National Monetary Commission, 
83 

Statesman's Yearbook, 86 


Banking statistics, U.S. Treasury 
Department, 79 
Abstract of Condition of National 
Banks, 79 

Bar charts, 104^105 
Bayes, T., 242 
Bernoulli, Daniel, 242 
Bernoulli, Jacques, 242 
Beta coefficient, 192-193 
Beta cross-product term, 425 
Biennial Census of Manufactures, 
537 

Binomial distribution, symmetrical 
{see Symmetrical binomial dis¬ 
tribution) 

Bivariate frequency distribution, 
325-353 

first-order standard deviation, 
relation to r, 351-353 
illustration of, 325-327 
table, 326 

joint variation illustrated (bivari¬ 
ate scatter diagrams), 339, 
343, 345 

methods of summarization and 
comparison, 327-353 
Pearsonian coefficient of correla¬ 
tion, 338-349 

analysis of variance, 352-353 
calculation of, 347, 349 
progressions of means, 328-329 
illustrated (graphs), 328-329 
Bivariate frequency surface, 471-486 
bivariate histogram, 469-471 
illustrated (three-dimensional 
diagram), 470 
independent variables, 471 
lines of regression, 486-488 
mathematical representation. 
487-488 

nonnormal, 491-492 
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Bivariate frequency surface, non¬ 
normal, product-moment for¬ 
mula for r, cases for use or 
nonuse, 491-492 

normal, dependent variables, 477- 
486 

derivation of equation, 481-486, 
492-496 

equation of rotated ellipse, 481- 
482 

horizontal cross section, 481 
horizontal view (graph), 477 
illustrated (graph), 480 
mathematical representation, 
482-486 

rotation and narrowing with 
correlation, 483-484 
vertical view, 478 
normal, independent variables, 

472- 476 

circular form with equal stand¬ 
ard deviations, 475-476 
illustrated, 472 
normal curve from which de¬ 
rived (graph), 473 
elliptical form with unequal 
standard deviations, 476 
horizontal section, 476 
illustrated (graph), 475 
mathematical representation, 

473- 476 

Bivariate histogram, 469-471 
illustrated (three-dimensional dia¬ 
gram), 470 

Bivariate scatter diagram, 365 
Bivariate series, 149-154 
Boltzmann, L., 19 
Boscovich, R. G., 242 
Boyle’s law, 19, 652 
Bradstreet’s index, 522 
Bureau of Foreign and Domestic 
Commerce, 56, 76-77, 512, 535 
Bureau of Home Economics, 48 
Bureau of Labor Statistics, 7, 42-50, 
54, 500, 535, 538 
indexes, 517, 525, 527 
Bureau of Mines, 79 
Business barometers {see Indexes) 


C 

Cartograms, 112-^121 
by bars, 118-119 
by colors and shades, 121 
by cross-hatching, 116-117, 121 
by dots or points, 112-115, 117- 
118, 120-121 
Charles’ law, 57 
Charlier check, 209, 35^355 
Charts, 100-121 
arithmetic, 129-131 
bar, 104-105 
bivariate, 150-154 
component-bar, 106-107 
cross-hatched zone, 107-109 
of frequency distributions, 143- 
149 

frequency polygon, 143-147 
histogram, 147 
on a ratio scale, 147-149 
pictogram, 102-103 
ratio, 131-137 

logarithms in relation to, 133- 
137 

sectors of circles, 104, 109-112 
split-bar, 110-112 
of time series, 128-138 
time series in relatives, 130 
types of, 101 

Chi square (x^) curve, 300 
Chi square (x^) test of goodness of 
fit, 300-305 
the x^ curve, 300 

critical values for ^ . ^J) , 

(table), 304 

weaknesses of test, 305 
Class interval, 144, 164 
Classical concept of probability, 
242-247 

Coefficient, confidence, 311-312 
moment, 180 

of multiple correlation, 398, 416- 
418 

of partial correlation, 418-422 
of risk, 310-311 
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Coefficient of correlation, arithmeti¬ 
cal view of, 339-347 
Charlier check, 354-355 
computation from grouped data, 
357-362 

short method, 357, 359-362 
tabulation of given data (table), 
356 

computation from ungrouped 
data, 354-357 
work sheet (table), 355 
distinguished from correlation 
ratio, 365 
first-order, 422 
order of, 422 
Pearsonian, 339 

relationship to line of regression, 
349-351 

second-order, 422 
third-order, 422 
zero-order, 422 
Combinations, 233-236 
binomial expansion in, 234-236 
defined and illustrated, 233-234 
Combinatorial analysis, problem in, 
270-283 

Commercial and Financial Chronicle, 

70 

Commercial statistics {see Sources 
of statistical data, commercial) 
Commodity Yearbook, 71 
Component-bar charts, 106-107 
Confidence coefficient, 311-312 
Confidence interval, 313 
Consumers' Incomes in the United 
States, 48 

Correlation, applications of, by 
social scientists, 324-325 
best way of studying, 365 
bivariate frequency table, 357 
coefficient of, Pearsonian, 339 
zero-order, first-order, second- 
order, etc., 422 

multiple (see Multiple correlation) 
nonlinear, 365-396 

(See also Curvilinear regres¬ 
sion) 


Correlation, origin and development 
of measurement of, 322-324 
partial (see Partial correlation) 
progress in discovery of, 321-322 
ratio, 365-376 
simple, 321-364 

Correlation coefficient (see Coeffi¬ 
cient of correlation) 

Correlation index, 39^396 
Correlation ratio, calculation of, 
368-373 

explained, 365-368 
Cournot, A., 12 
Covariance, 405 
Coxe, Tench, 72 
Curve, error, 294-295 
frequency, theoretical significance 
of, 162-166 

Gaussian error, 194, 294-295 
growth vs. frequency, 149 
normal, characteristics of, 265 
formula for, 263-267 
method of fitting to sample 
histogram, 299-300 
normal frequency, 232-320 
characteristics of, 265 
formula for, 263-267 

(See also Normal frequency 
curve) 

probability, 254 
of regression, 367 
standard normal, characteristic of, 
266-267 

formula for, 267 

Curve fitting, curvilinear regres¬ 
sions, 376-397 

fitting normal curve, 299-300 
fitting trends to time series, 564- 
616 

Curvilinear regression, calculation 
of, 370-394 

correlation index, 394r-396 

and analysis of variance, 395-396 
correlation ratio, and analysis of 
variance, 373-376 
calculation of, 368-373 
work sheet (table), 371 
explained, 365-368 
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Curvilinear regression, estimates 
based on regression equations, 
38^390 

illustrated, by bivariate scatter 
diagram and fitted curve, 377 
relationship in logarithmic form 
(graph), 378 

relationship in reciprocal form 
(graph), 381 
logarithmic, 377-380 
illustrated (graph), 379 
practical estimates based on 
equation derived, 388-390 
standard error of estimate, cal¬ 
culated, 390-394 
transformation of problem into 
simple linear correlation, 379- 
380 

parabolic, 383-388 
curve fitted directly, 384 
Doolittle method for solving 
three equations, 384r-388 
work sheet (table), 386 
practical estimates based on 
equation derived, 388^390 
standard order of estimate, cal¬ 
culated, 390-394 
reciprocal, 381-383 
illustrated (graph), 381 
practical estimates based on 
equation derived, 388--390 
standard error of estimate, cal¬ 
culated, 390-394 
transformation to simple linear 
correlation, 351-353 
standard error of estimate, 390- 
394 

calculated, 391-393 
diffmnces for three types of 
regression, 393-394 
first-order standard deviation 
used as, 390-391 
summarized with practical esti¬ 
mates (table), 393 
use of Pea^nian coefiicient of 
correlation, in logarithmic 
approach, 379 
in reciprocal approach, 382 


Cycle determination, 637-650 
in annual data, 637-642 
annual trend analysis (table 
- and graph), 640 
cyclical movements shown 
(table), 641 

major cycle and cycle with 
residuals (graph), 642 
danger in extrapolating trends, 
648 

major cycle, 641-642 
method of ratios vs. method of 
differences, 648-650 
in monthly data, adjustment re¬ 
quired, 637, 642-644 
danger in extrapolating trends, 
648 

method of determining cycle 
illustrated, 644-647 
where trend is a second- or third- 
degree polynomial, 647-648 
work sheet (table), 643 
Cycles, 545, 637-650 
analysis by empirical trends, 594- 
598 

ogive-like, 552 

, D 

Data, cumulative vs. noncumulative, 
127-128 

gathering of, 24^51 

construction of questionnaires 
or schedules, 30-42 
rational basis for, 28-29 
sampling, 42-49 
units of description and meas¬ 
urement, 28-48 
{See also Questionnaires; 
Schedules) 

sources of {see Sources of statisti¬ 
cal data) 

three types of statistical, 4-6 
De Moivre, A., 242 
Density function, 489 
in description of multivariate 
distributions, 488-4^ 
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Determination of normality {see 
Normality) 

Dewey, John, 22 

Directory of Federal Statstical 
Agencies, 64 

Distribution, of frequency {see Fre¬ 
quency distributions) 
of probability {see Probability 
distributions) 

symmetrical binomial {see Sym¬ 
metrical binomial distribu¬ 
tion) 

Domesday Book, 25 
Doolittle work sheet for curvilinear 
correlation, 384, 386-388 
Doolittle work sheet for curvilinear 
regression, work sheet stable), 
386 

E 

Eddington, Sir Arthur Stanley, 
18-19, 21-22 
Einstein, Albert, 21 
Empirical trends, 582-598 
analysis of cycles by, 594-598 
straight-line and third-degree 
trends wifh raw data, illus¬ 
trated (graph), 597 
conclusions from trends de¬ 
rived, 598 

work sheet for trend and index 
of normal (table), 594 
w’ork sheet for trend values, 
method of finite differences 
(table), 597 

finite differences method for trend 
values, 589-594 

aid for computing finite differ¬ 
ences at < = 0 (table), 590 
building up a polynomial 
(table), 589 

danger of cumulative error, 
593-594 

maximum cumulated errors 
(table), 593 

work sheet (table), 592 
polynomial, 583-594 


Empirical trends, polynomial, econ¬ 
omy of calculation in work 
sheet, 583-589 

economical work sheet, alge¬ 
braic illustration (table), 

585 

economical work sheet, arith¬ 
metical illustration (table), 

586 

work sheet for second-degree 
polynomial (table), 588 
straight-line trend, 582-583 
work sheet for index of normal 
and trend, 582 

Enumeration, districts, U.S. census, 
35 

problems of, 28-42 
Enumerators, directions to, 35-39, 
42, 44 

training of, 30, 42, 44 
typical problems facing, 29 
Equiprobability, ellipsoids of, 489- 
490 

Error curve, 294-295 
Error, standard, of estimate, 383 
for statistics where sampling 
distribution approximates 
normal curve (table), 320 
Estadisticaj 90 
Estimates, manufacturers, 7 
Euler, L., 242 

Extrapolation of trends, danger in, 
648 

F 

Federal Reserve Bank of New York, 
517 

Monthly Review of Credit and 
Business Conditions, 534 
Federal Reserve Board, 500, 532 
index, 533 

Federal Reserve Bulletin, 86, 531-532 
Federal Reserve System, 512 
Federal statistical agencies, 71-84 
{See also Sources of statistical 
data) 

Financial statistics, sources of {see 
Sources of statistical data, 
financial) 
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Finite differences, method of finding 
trend values, 589-594 
danger of cumulative error, 
593-594 

First-order standard deviation, de¬ 
finition, 390 
Fisher, Irving, 518, 530 
Forecasting, 651-680 
agencies, 661-662 
Babson, 661, 671-672 
Brookmire Economic Society, 
661, 671-672 

Harvard Economic Society, 
661, 671-673 

Moody's Investor's Service, 
661, 671-672 

Standard & Poor's Corporation, 
661, 671-672 

ancient origin of pseudo-scientific, 
651 

combined seasonal and cyclical, 
677-678 

illustrations of, 678 
commercial uses of, 661-662 
cycles with time series, 661-675 
general business conditions, 
663-673 

business barometer, 664, 666, 
670 

combination indexes, 666- 
667, 670 

crosscut analysis method, 
66^665, 671-673 
historical analogy method, 
663-671 

indexes of national-income, 
667-670 

indexes of physical volume of 
production (Babson index), 
670-671 

lead-lag difficulties in fore¬ 
casting, 670 

types of indexes, 664-673 
particular lines of activity, 673- 
675 

crosscut analysis method, 675 
crude historical analogy 
method, illustrated, 673 


Forecasting, cycles with time series, 
particular lines of activity, 
cycle hypothesis for, 674^675 
lead-lag relationships, 674 
from distribution studies, 656-661 
bivariate distributions, 657-658 
errors of forecasts, 659 
monovariate distributions, 656- 
657 

multivariate distributions, 658- 
659 

modern scientific, 652-656 
conditional, 653-654 
illustrations of, 654r-656 
popular dramatization of fore¬ 
casts, 652-653 

qualitative vs. quantitative, 654 
use of statistics in, 656 
quality and effect of economic 
forecasting, 679-680 
with seasonal variation, 675-678 
historical analogy, 676-677 
trends with time series, less exact 
forecasting, 660-661 
more exact forecasting, 659-660 
Foreign trade statistics, sources of, 77 
Fourier's theorem, 561 
Fr^het, Maurice, 250 
Frequency concept of probability, 
247-249 

Frequency curves, definition, 631-164 
derivation from histograms, 162- 

164 

formulas for, 263-267 
normal {see Normal frequency 
curve) 

uses of, 164-166 

in graduating observed data, 164- 

165 

as a norm, 165 

in sampling analysis, 165-166 
Frequency distribution analysis, 
numerical computation, 

231 

arithmetic mean, rule for, 214 
averages and variability, diffi¬ 
culties in locating median and 
mode, 216-217 
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Frequency distribution analysis, 
numerical computation, beta 
coefficients, 216 
calculations, 216-227 

average deviation, 220-224 
histogram assumption in 
grouped data, 221-224 
mid-value assumption in 
grouped data, 220-221 
averages and variability, diffi¬ 
culties in locating median 
and mode, 216-217 
coefficients, of skewness, 226- 
227 

of variability, 225-226 
measures of skewness, 225 
median and quartiles, 218-220 
mode, 217-218 
semiquartile range, 224-225 
construction of class interval, 
190-206 

determining the class inter¬ 
val, effect of too many 
intervals, 202 

mter\^al size chosen to re¬ 
veal character of varia¬ 
tion, 202-207 

illustrative material, distribu¬ 
tion with various class inter¬ 
vals (tables), 203-205 
mean square deviation, 215 
moments about the arbitrary 
origin, 207, 211-212 
moments about the arithmetic 
mean, 212-214 

scatter diagram and graph, 201 
standard deviation, 215-216 
with unequal class intervals, 
228-230 

variability and skewness, graph¬ 
ic interpretation of, 227-228 
work sheet, 206-216 
Charlier check for, 209 
entering the distribution, 208 
illustrated (table), 231!? 
saving calculation, by obtain¬ 
ing moments about an 
arbitrary origin, 2G7 


Frequency distribution analysis, 
numeral computajbion, work 
sheet, saving calculation, 
in use of work sheet, 208 - 
209 

by using class-interval units, 
207-208 

theory, 158-198 
averages, 167-180 

arithmetic mean, 167-170 
concept of, as summary fig¬ 
ures, 179-180 
geometric mean, 173-176 
harmonic mean, 176-179 
median, definition, 170-172 
theory, mode, definition, 172- 
173 

basic formulas used in, sum¬ 
mary of, 198 

beta coefficients, 192-193, 195 
bivariate (see Bivariate fre¬ 
quency distribution) 
charts of, 162-164 
histograms, 162-163 
area histograms, 162 
relative frequencies in, 162- 
163 

determination of normality of, 
297-306 

frequency curves (see Frequency 
curves) 

kurtosis of, 193-196 
measurements of summarization 
and comparison, 166-182 
measures of variability, average 
deviation, 183-184 
quartiles of, 171-172 
range, 182-183 
standard deviation, 184-185 
variance, 166, 185 
moments of, 180-182 
the centroid, 181 
moment coefficient, 180 
purpose of, 181-182 
multivariate (see Multivariate 
frequency distribution) 
of populations, 166-167 
parameters of, 167 
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Frequency distribution analysis, 
theory, possible types of com¬ 
parison, 159-162 
as probability distributions, 254 
of sample data, 167 
statistics of, 167 
sampling distributions (see 

Sampling distributions) 
skewness of, 185-193 
S3nnbols used in, summary, 197 
trivariate (see Multivariate fre-^ 
quency distribution, tri¬ 

variate) 
use of, 158-159 

Frequency distributions, 140-149 
conventional manner of graphing, 
143-144 

discrete vs. continuous, 142-143 
irrational, 156 

nature and illustration of, 140-142 
rational, 155-157 

Frequency polygon, relative slope at 
a given point computed (graph), 
288 

Frequency series, 138-143 
definition, 138-139 

(See also Frequency distribution) 

Frequency surfaces, 469-492 
bivariate, 471-486 
bivariate histogram, 469-471 
multivariate (density function), 
488-491 

Frequency table, 143 

Functions, compound-interest, 261- 
262 

discount, 263 
explicit, 255-256 
exponential, declining, 262-263 
rising, 261-262 

functional relationships, 255-257 
hyperbolic (table), 695^96 
in^plicit, 255-256, 259-260 
joint, 255 
linear, 256 
nonlinear, 256-257 
simple, 257-263 

(See also Simple functions, 
graphs of) 


G 

Galton, Sir Francis, 14, 196, 293, 
323-324 

Gaussian error curve, 194, 294-295 
Geological Survey, 533 
Gompertz logistic curve, 554 
Goodness of fit, described, 287 
illustrated (graph), 288 
test of, 300-305 

[See also Chi square (x*) test 
of goodness of fit] 

Government Publications and Their 
Uscj 65 

Graphs, (see also Charts) 
of simple functions, 257-267 

(See also Simple functions, 
graphs of) 

Graunt, John, 65-66 
Growth, curves, 149 

(See also Rational trends) 
explanation of, 553 
Guides to sources, 62-65 
governmental, 64-65 

Directory of Federal Statistical 
Agencies, 64 

Government Publications and 
Their Use, 65 

U.S. Government Manual, 64 
U.S. Government Publications, 
65 

nongovernmental statistics, 62-64 
handbooks and general index 
material, 63-64 
magazine indexes, 62-63 
Guilds, early sources of statistics, 
25 

H 

Handbooks, 57 

Harvard College Observatory, 13 
Harvard index, 665-666 
Heisenberg, W., 19-20 
Heisenberg’s uncertainty measure¬ 
ment, 19-20 

Histograms, in frequency distribu- 
tions, 162-163 
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Hollerith, Herman, 55 

Hollerith tabulating machines, 55, 
72-73 

Huygens, C., 242 

Hyperbolic functions (table), 695- 
696 

Hyperplane of regression, 490-491 
I 

Index to Business Indexes^ An, 63 

Index chart, capital formation, 669 
consumer spending, 668 
Harvard, 665 

Indexes, adjustment to bench marks, 
53^-542 

ideal conditions for stratified 
sampling absent, 535-536 
method of adjustment illus¬ 
trated, 538-542 

monthly indexes adjusted to 
census figures (table), 540- 
541 

reasons for adjustment, 536-538 
of correlation {see Correlation 
index) 

of general business conditions, 
533-535 

Harvard, 665-666 
method of computation illus¬ 
trated, 528 

price indexes, aggregative, using 
given-year weights, 529 
of production, 82, 530-531 
quantity indexes, and business 
barometers, 530-535 
stratified sampling in, 530 
of trade and production, 530- 
531 

computation of weights illus¬ 
trated, 532-533 
weighted by prices, 529-530 
relative series from time series 
(chart), 130 

U.S. Bureau of Labor Statistics, 
construction of indexes, 525- 
527 


Index numbers, 497-542 
application of sampling technique, 
513-514 

stratified sampling, 515-516 
composite, 512-513 
stratified sampling in construc¬ 
tion of, 513 

construction of, aggregative 
method, simple, 522-524 
weighted, 524-525 
average-of-relatives method, 
simple, 518-520 
weighted, 520-522 
methods in general, 518 
conversion of absolute to relative 
numbers, 500-511 
absolutes, 500-501 
relative parts of a whole, 509- 

511 

relatives, 501-503 
relatives using a base period in 
time series, 503-505 
presumption of normality in 
base selected, 505-509 
history of discovery and use, 497- 
500 

simple, great variety in use, 511- 

512 

variety of purposes of, 516-518 
Industrial statistics, sources of, 66, 
77, 83 

Bureau of Manufactures, 77 
The Economic Almanac, 67 
Industrial Commission, 83 
7.0,10 

International Statistical Yearbook, 85 
Intuitive-axiomatic approach to 
probability, 250-251 
Iron Age, 68 

J 

Journal of the American Statistical 
Association, 534 

K 

King, W. I., 516 
Kolmogoroff, A., 250 
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Kurtosis, 162, 193-196 
Kuznets, Simon S., law of growth, 
explanation of, 554-558 

L 

Labor statistics, sources of 61, 74- 
75, 84 

Commission on Industrial Rela¬ 
tions, 84 

Department of Labor, 61 

Bureau of Labor Statistics, 78, 

86 

Monthly Labor Review, 78 
Lagrange, J. L., 242 
Lambert, J. H., 242 
Laplace, P. S., 242-243, 250 
Law of large numbers, 239-240 
League of Nations, 88-91 
indexes, 512 
publications, 512 

Least squares, method of, to find 
line of regression in bivariate 
frequency distribution, 331-334 
Legendre, A. M., 242 
Line of regression, 329-335 
becomes hyperplane of regression 
in multivariate distribution, 
490-491 

derived by method of least 
squares, 331-334 
interpretation of, 336-338 
means of rows and columns 
(table), 330 

relationship to r, 349-351 
standard deviation about means 
or line of regression, 336-338 
standard deviations for columns 
of data given (table), 337 
of Xi on X 2 , 330-334 
illustrative diagram, 332 
of X 2 on Xi, 335 
illustrative diagram, 334 
Linear plane of regression, second- 
order variances for, 413-416 
Lines of regression, in bivarate fre¬ 
quency distribution, calculated 
from given data, after compu;»>* 
ing r, 362-363 


Lines of regression, work of compu¬ 
tation in fitting to time series 
when more than two coeffi¬ 
cients, 599 

Loci of equiprobability, in mulit- 
variate frequency ‘^surface,^^ 489 
Logarithmic charts (see Ratio charts) 
Logarithmic regression, 377-380 
Logarithms, of numbers, four-place 
common (table), 681-684 
scale for ratio charts, 131-137 
Logistic growth curves, (see Rational 
trends) 

M 

Magazine indexes, 62-63 
Market Research Series, 63 
Maximum likelihood, method for 
single best estimate of popula¬ 
tion percentage in sampling, 
314-315 

Means, progressions of, graph, 328- 
329 

Measurement of General Exchange- 
Value, The, 500 
Minerals Yearbook, 58 
Mises, Richard von, 245-251, 269 
Mitchell, Wesley C., 500, 514, 530, 
553 

Business Cycles—The Problem and 
Its Setting, 499, 513, 535 
law of growth, explanation of, 553 
Monthly Bulletin of Statistics, 85 
Monthly Labor Review, 78 
Multiple correlation, 397-436 
analysis of variance in, 422-429 
coefficient of direct determina¬ 
tion, 424 

illustrated (diagram), 424 
coefficient of joint determina¬ 
tion, 425 

coefficient of multiple correhi 
tion, 426-428 

coefficient of net regression, 
424r-425 

--beta cross-product term, 424- 

435 
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Multiple correlation, analysis of vari¬ 
ance in, r beta cross-product 
term, illustrated (diagram), 
424 

residual variance, illustrated 
(diagram), 424 

analysis of variance and causal 
relationships, 428-429 
coefficient of, 398, 416-418 
definition, 397-399 
extended to any number of vari¬ 
ables, 434-436 

extension of formulas, high- 
order variances, 436 
multiple-correlation formulas, 
436 

partial-correlation formulas, 
436 

statistics for regression 
planes, 435-436 
general approaches, 434r-436 
extension of analysis to four 
variables, 429-434 
multiple correlation coefficient, 
434 

partial correlation in four-vari¬ 
able case, 433-434 
in terms of correlation statistics 
of same order, 432-433 
in terms of lower-order correla¬ 
tion statistics of same kind, 
431-432 

in terms of lower-order r's and 
<r’s, 432 

third-order variance, 434 
linear vs. nonlinear relationships, 
399-400 

notation used in, 401-404 

meaning of subscripts before 
and after point, 402 
symmetry of, 404 
partial correlation {see Partial 
correlation) 

Multiple linear regression, 410-416 
beta form of regression equation, 
410-413 

obtained by method of least 
squares^ 410 


Multiple linear regression, beta form 
of regression equation^ a’s and 
6’s calculated from beta form, 
412-413 

second-order variances for linear 
plane of regression, 413-416 
Multivariate frequency distribution, 
404r-410 

analysis, illustrated, 437-468 
trivariate statistics, interpreta¬ 
tion of results, illustrated, 
analysis of variance in X, 
451, 455 

estimates based on regression 
equation, 450-451 
partical-correlation coeffi¬ 
cients, 451 

best approaches in studying, 409-’ 
410 

calculation of trivariate statistics, 
444-450 

all-round check on, 450 
equations of three planes of 
regression, as found, 448-449 
first-order correlation statistics, 
445-450 
a statistics, 448 
h statistics from the beta^s,448 
coefficients of partial correla¬ 
tion, 447-448 

first-order beta’s from zero- 
order r’s (table), 446 
interpretation of results, illus¬ 
trated, 450-451, 455 
multiple-correlation coefficients, 
449-450 

second-order standard devia¬ 
tions, 449 

zero-order correlation statistics, 
444-445 

examination of, 437, 442-444 
by testing net regression, 437, 
442-443 

trivariate, 405-410 

^ conditions for independence of 
all variables, 408-409 
illustrated (diagram), 405 
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Multivariate frequency distribution, 
trivariate, studied by breaking 
up into bivariate distributions, 
406-408 

trivaiiate analysis illustrated, cor¬ 
relation tables of, 406-407 
correlation table of, Xi and X 2 , 
406 

X 2 and Xj, 406 
Xi and Xj, 407 

Multivariate frequency surface, non¬ 
normal distribution, 491-492 
normal, 488-491 
deviations normally distributed, 
490 

effect of greater correlation on 
ellipsoid shape, 490 
ellipsoids of equiprobability, 
489-491 

illustrated (graph), 490 
in reality a density function, 489 

N 

National Bureau of Economic Re¬ 
search, 67 

National Industrial Conference 
Board, 66 

National Research Planning Board, 
publications, 84 

National Resources Committee, 48 
New Jersey State Labor Dept., 538 
New York Timesj The, 534 
Newtonian mechanics, 19-21 
Neyman, N., 236, 250 
Nonlinear correlation, 365-396 

{See also Curvilinear regression) 
Nonnormality, in bivariate or multi¬ 
variate distributions, 491-492 
of population in sampling, 316 
Normal frequency curve, 232-320 
algebraic and graphic representa¬ 
tion of, 263-267 
algebraic formula, 264 
graph, 263 

graphs of curves with different 
means and same standard 
deviations^ 265 


Normal frequency curve, algebraic 
and graphic representation of, 
graphs of curves with same 
mean and different standard 
deviations, 266 
areas under (table), 693 
fitted to histogram of given data 
(graph), 295 

method of fitting to sample histo¬ 
gram, 269-300 
ordinates of (table), 694 
real life conditions producing, 
290-297 

recurrence in statistical analysis, 
264 

standard normal curve (graph), 
267 

and symmetrical binomial distri¬ 
bution, 279-306 
use in theory of sampling, 264 
useful approximation to binomial 
distribution where N is large, 
308 

Normal frequency surface, 469-496 
(>See also Bivariate frequency 
surface; Multivariate fre¬ 
quency surface) 

Normal probability curve, (see Nor¬ 
mal frequency curve) 

Normality, determination of, in bi¬ 
nomial distribution, 297-306 
by comparison of special statis¬ 
tics, 305-306 

in frequency distributions, 297- 
806 

by graphic comparison, 298-300 
fitting normal curve to sam¬ 
ple histogram, method of, 
29^390 

by test for goodness of fit (see 
Goodness of fit, test of) 
in time-series, indexes, 505-509 
of population, in sampling, 316- 
317 

O 

Order, of correlation coefficients, 422 
of correlation statistics, 422 
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Order, designation in correlation sta¬ 
tistics, indicates combination of 
variables, 422 
of regression statistics, 422 
of standard deviations, 422 
of variance, 363-364, 414-416 
Orthogonal-polynomial trends, 59^ 
616 

calculation of coefficients A, B, 
C, . . . , 606-607 
by subtotal summation type of 
work sheet, 608-612 
orthogonal polynomials, defini¬ 
tion, 600-601 

forms used in fitting trends, 
603-606 

tables to save calculation, 612-615 
values of specified variables, 
dependent on number of 
years (tables), 613-615 
trend line by method of least 
squares, 601-603 
uses in trend analysis, 599-600 

P 

Parabolic regression, 383-388 
Parameter, definition of, 167 
Partial correlation, coefficient of, 
418-422 

calculation, 420-422 
definition, 400-401 
notation used in, 401-404 
obtained between two variable 
by holding third variable con¬ 
stant, 419-420 
Pascal, B., 242 

Pearl-Reed population curve, 549 
Pearson, Karl, 66, 293-294, 323- 
325, 339 

Pearson-Galton apparatus for bi¬ 
nomial distribution, 293-294 
illustrated, 293 

Pearsonian coefficient of correlation, 
333-349 

arithmetic view of r, 339-347 
Pepin the Short, 24 
Percentage, population percentage, 
313-315 


Periodogram, 561-562 
Permutations, defined and illus¬ 
trated, 232-233 ' 

Persons, Warren M., 66, 530, 623 
Petty, Sir William, 65-66 
Pictograms, 102-103 
Planck, Max, 19 
Playfair, William, 109-101 
Polling agencies, 6 
Polynomials, definition, 256 
first-degree, graph of, 257-258 
implicit, 255, 259-260 
second-degree, graph of, 258-259 
Population, curves, 548-549 
laws of growth, 549-550 
technical term in frequency dis¬ 
tribution, 166 
theories, early, 549-550 
Prescott, Raymond B., 554 
Presentation of statistics, 92-121 
cartograms, 112-121 

{See also Cartograms) 
charts, 100-121 

{See also Charts) 
tables, 92-100 

Probability, combinations, 233-236 
concepts of, 236 
classical, 242-243 
criticism of classical concept, 
243-247 

meaning of ‘‘equally likely, 

244 

principle of indifference, 244- 

245 

principle of sufficient reason, 
24fi-246 

subjective character of, 246- 
247 

frequency concept, 247 

criticism of von Mises’ 
theory, 247-250 
intuitive-axiomatic approach, 
250-251 

curve, formulas for, 263-267 

{See also Frequency curves; 
Normal frequency curve) 
definition, 237 
dependent, 271-272 
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Probability, empirically determined, 
240-241 

independent, 270-271 
independent vs. dependent illus¬ 
trated in real life case, 273-274 
law of large numbers, 239-240 
permutations, 232-233 
of possible combinations of 10 
coins (table), 282 
randomness, 241-242 
and relative frequency of actual 
events, 239-242 

Probability calculus, 268-278 
addition theorem, 268-269 
for dependent probabilities, 271— 
272 

examples of calculation, for dis¬ 
crete distributions, 274-276 
for continuous distribution, 
276-278 

independence vs. dependence il¬ 
lustrated in real life case, 
273-274 

for independent probability, 270- 
271 

multiplication theorem, 269-274 
statement, 269 

Probability distributions, 252-267 
continuous, 253-254 
discrete, 253 

functional relationships in (see 
Functions) 

identical with certain types of 
frequency distribution, 254 
probability curve, 254 

(See also Normal frequency 
curve) 

Probability sets, calculation of ‘‘de¬ 
rived^* or “second-order'^ sets, 
269Jf. 

finite, multiplication theorm valid 
for, 272 

fundamental, 237-238 
infinite, 238-239 
multiplication theorem valid 
for, 272 

Problem of Estimation^ The, 600 


Product deviation, measurement of, 
339-347 

Product-moment coefficient of cor¬ 
relation, 339^. 

Product-moment formula for r, use 
in nonnormal frequency distri¬ 
butions, 491-492 
Product term, definition, 485 

disappears where correlation is 
absent, 485 

Public opinion, sampling of, 6 
Publications, statistical (see Statisti¬ 
cal publications) 

Q 

Quality control, 18, 248 
Quantum theory, 19-21 
Quartiles, calculation of, 218-220 
definition, 171-172 
interpretation, 227-228 
use in measuring skewness, 189- 
191 

Questionnaires, mailed, 48-49 

good-will letter used in support 
of (typical form), 50 
rules for constructing, 49, 51 
(See also Schedules) 

Quetelet, A., 87, 499, 513, 549-550 

R 

Ratio charts, 131-137 

advantages and disadvantages of, 
135^137 

paper used for, 133-135 
relative growth shown on, 131, 
136-137 

three scales of paper used for, 
134-135 

value for comparisons impossible 
on arithmetic paper, 136-137 
Ratio scale (see Semilogarithmic 
paper) 

Rational trends, 547-558, 574-581 
dying institution, illustrated, 475- 
577 

possible trends in dying insti¬ 
tution, 674 
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Rational trends, dying institution, 
illustrated, trend fitted by 
method of least squares 
(graph), 577 

work sheet for annual index 
of normal and trend 
(table), 576 

growing institution, illustrated, 
578^681 

curve fitted by method of 
selected points (graph), 581 
method of selected points, 
578-581 

work sheet for index of nor¬ 
mal and trend (table), 580 
Reciprocal regression, 381-383 
Reciprocals of numbers (table), 691- 
692 

Regression, linear plane of, 397-399, 
410-413 

(See also Linear plane of 
regression) 
logarithmic, 377-380 
multiple linear, 410-416 
parabolic, 383-388 
reciprocal, 391-383 
statistics, order in, 422 
Relative frequency, probability, 280 
Relativity theory, 21 
Research associations, 66-68 
Review of Economic Statistics, 66, 
500, 530 

S 

Sampling, 42-48 

by Bureau of Labor Statistics, 
42-48 

fitting of normal curve to sample 
histogram, 299-300 
in government study of family 
income and expenditures, 48 
of means, 315-319 

population mean, confidence 
limits for, 318 
estimate of, 318 
testing a hypothesis about, 
317-318 

sampling distribution, 315-316 


Sampling, testing a hypothesis, 317- 
318 

in 1940 U.S. Census, 42 ^ 
of percentages, 307-315 
coefficient of risk, 310-311 
confidence coefficient, 311-312 
confidence interval, 313 
population percentage, deter¬ 
mining confidence limits 
for, 311-313 

likelihood of, defined, 315 
hkelihood of, relation to 
probability of sample (dia¬ 
gram), 314 

maximum likelihood, estimate 
of, 313-315 

testing hypothesis about, 
309-311 

sampling distribution, 307 -309 
statistical inferences from, 309- 

315 

types of inference, 309 
typical problem. 307 
random, 241-242, 307 
relative frequency of samples 
follows binomial distribution 
pattern, 308 

sampling distributions, (see Sam¬ 
pling distributions) 
standard errors for selected statis¬ 
tics, where distribution ap¬ 
proximates normal curve 
(table), 320 

stratified, in construction of index 
numbers, 515-516 
use of normal frequency curve in, 
307-320 

conclusions as to, 319-320 
used in business, 9 
of variances, population variance, 

316 

confidence limits for, 319 
optimum estimate of, 319 
testing a hypothesis about, 
318-319 

sampling distribution, 315-316 
standard deviation, the, 317 
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Sampling distributions, 307-317 
explained, 307-308 
of sample means, 315-316 
of sample percentages, 307-308 
of sample variance, 316-317 
Schedules, coding of, 53-55 
editing of, 52-53 

mailed questionnaires (see Ques¬ 
tionnaires, mailed) 
problems of enumeration, 28-42 
questionnaires (see aUo Question¬ 
naires) 

tabulation of, 55 

{See also Tables; Tabulation) 
units of description and measure¬ 
ment, 28-34 

illustrations of government care 
in, 35-48 

Seasonal variation, 617-636 
causes of, 618-621 
historical background of study, 
617-618 

in labor, McCabe, 620 
measurement illustrated, 625-633 
calculation by 12 months’ mov¬ 
ing average method, 625, 
630-633 

completed index (table), 631 
multiple frequency array, deter¬ 
minations from, 633 
, illustrated (graphs), 631-632 
work sheet for calculating index 
(tables), 626-630 

method of detecting change in, 
633-636 

computation of index for single 
year (table), 636 
index required for each year 
because of observable trend, 
636 

trends in seasonal variation 
illustrated (graphs), 634-635 
methods of measuring, 621-625 
Kemmerer, 623 
Persons, 623 

problem of isolating, 621-625 
link relative method, 623 


Seasonal variation, problem of iso¬ 
lating, ratio-difference-from- 
trend method, 624 
twelve months* moving average 
method, 624r-625 
various suggested methods, 
bibliography for, 624n 
testing whether well defined, 632 
trend in, 634-635 

Second-order, indicates statistic with 
two figures to right of decimal 
in subcript, 422 

Semilogarithmic paper, 133-135 
Series, bivariate, 149-154 
frequency, 139-149 
Sheppard’s correction, 299 
Shewhart, W. A., 248 
Significant figures, meaning of, 230- 
231 

Simple correlation, 351-353 
lines of regression, calculated, 
362-363 

Simple functions, graphs of, 257-267 
circle, 259-260 
ellipse, 260 

exponential function, declining, 
262-263 
rising, 261-262 

first-degree polynomials, 257- 

258 

normal frequency curve, 263-267 
second-degree polynumials, 258- 

259 

Simpson, C. G., 14, 242 
Single best estimate, of population 
percentage in sampling, 314-315 
Skewness, definition and significance 
of, 185-193 

measurement of, by beta coeffi¬ 
cients, 192-193 

by medians and quartiles, 189- 
191 

by relation of mean, median, 
and mode, 185-189 
by third moment, 191-192 
Smith, Adam, 11-12 
Smithsonian Institution, 13 
Social Science Research Council, 48 
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Social Security Administration, 537 
Sources of statistical data, 56-91 
(^ee also Guides to sources of 
statistics) 

agricultural, {see Agricultural sta¬ 
tistics) 

banking, {see Banking statistics, 
sources of) 
commercial, 6S-71 
commercial and financial publi¬ 
cations, 70-71 
trade associations, 69-70 
trade journals, 68-69 
federal, 71-84 

Congressional investigations, 82 
financial, 70 

commercial and financial publi¬ 
cations, 70-71 

general summary, developing pat¬ 
tern of sources, 59-62 
for social sciences, 57-58 
guides to {see Guides to sources) 
industrial, {see Industrial statis¬ 
tics, sources of) 
international, 91 

on labor, {see Labor statistics, 
sources of) 

pattern of existing (outline), 61-62 
primary vs. secondary, 56 
private research, individuals, 61, 
65-66 

handbooks on, 63 
research associations, 66-68 
{See also Statistical publica¬ 
tions) 

state and municipal, 84-85 
on trade, {see Trade) 
on transportation and com¬ 
munication, {see Transporta¬ 
tion and communication 
statistics) 

world statistics, 85-91 
best sources of, 91 
Split-bar charts, 110-111 
Square roots of numbers, 100—1000 
table, 689-690 
Squares, lOQ-990 
of numbers (table), 685-686 


Standard deviation, first-order, 338 
from lines of regression, -836-338 
zero-order, 338 

Standard error, of variance, 316 
of estimate, 338 
definition, 390 

Standard Industrial CMficaiion 
Code, 54 

Statesman's Yearbook, 86 
Statistic, definition of a, 167 
Statistical Abstract of the United 
Stales, 64 

Statistical Ailas, 73 
Statistical data {see Data) 
gathering of {see Data, gathering 
of) 

Statistical laws, 19 
Statistical publications, abstracting 
agencies, 57n 
world statistics, 85-91 

{See also Guides to sources) 
Statistics, accuracy in calculating, 
230-231 

in the arts and sciences, 1-23 
in astronomy, 12-13 
in biology, 14^16 
in business administration, 7-9 
definition and meaning of, 1-4 
descriptive, 232 ♦ 

in economic theory, 11-12 
in education, 9-11 
in engineering, 16-18 
forecasting by means of, 651-680 
gathering of, 24-55 

historical development in, 24-28 
{See also Data, gathering of) 
in governmental administration, 
0-7 

in medicine, 16 
and philosophy, 21-^22 
in physics and chemistry, 81-21 
in politics, 6 

presentation of {see Presentation 
of statistics) 
in sociology, 11 

sources of {see Sources of statisti¬ 
cal data) 
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Statistics, summarization and com¬ 
parison by means of fre¬ 
quency distributions, 158- 
198 

by means of index numbers, 
497-542 

measurements for, 166-182 
symbols used in {see Symbols used 
in statistics) 

theoretical, definition, 232 

by use of index numbers, 497^. 
in zoology, 18-14 

Summarization and comparison, in 
bivariate frequency distribu¬ 
tions, 327—353 
measurements of, 166-182 

Survey of Current Business, 77, 512, 
525, 531, 534 

Symbols used in statistics, 122-129 
basic symbols, 122-124 
multiple and partial correlation, 
401-404 

time series, 124-126 
passage of time, 124-125 
units involved, 126-127 
where variable fluctuates with 
time, 125-126 

Symmetrical binomial distribution, 
chailcter of, 283-285 
mean, 283-284 
moments, 285 
symmetry, 283 
variance, 284-285 
derivation, 280-283 
graph of, 284 

and the normal curve, 279-306 
beta values approach those of 
normal curve, 289 
distribution approaches normal 
curve as limit, 285^290 
graphic comparison, 288 
relative slope of frequency poly¬ 
gon-and normal curve com¬ 
pared, 289 

real life conditions producing, 

290-297 

summary, 295-297 


Symmetrical binomial distribution, 
relative slope of frequency poly¬ 
gon computed for a given point 
(graph), 288 

for two values of N (graph), 286 
effect of scale adjustments 
(graph), 287 

seen in relative frequencies, 
282, 291-293 

Symmetry, in frequency surfaces, 
486-487 

in notation for multiple and par¬ 
tial correlation, 404 
writing of equations by, illus¬ 
trated, 431-432 

T 

Tables, 92-100 
construction of, 92-93, 95-96 
general-purpose, 93-94 
special-purpose, 93, 95-97, 100 
types of, illustrated, 94-99 
Tabulation, machine, 55, 72-73 
mechanics of, 73 
principles of, 92 
{See also Tables) 

Test of goodness of fit, 300-305 

{See also Chi square (x*) test of 
goodness of fit) 

Theorem, Fourier^s, 561 
Theory of errors, 294-296 
not intended to mask inaccuracy 
of calculation, 230 
Theory of relativity {see Relativity 
theory) 

Third-order statistics, 422 
Time series, analysis of {see Cycle 
determination; Seasonal varia¬ 
tion; Trend analysis) 
careful description of units in¬ 
volved, 126 

conventional charting of, 128-129 
cumulative vs. noncumulative 
data, 127-128 

elements of variation in, 543-547 
cycle, 544-547 

long-term growth or trend, 643- 
547 
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Time series, elements of variation 
in, residual fluctuations, 574 
seasonal variations, 543-547 
hypothetical, showing elements of 
variation (table), 544 
rational basis of analysis of, 543- 
563 

rational trends, 547-558, 564 
Time-series analysis, development of 
technique for, 560-563 
harmonic (periodogram) analy¬ 
sis, 561-562 

major cycle determination by 
Kuznets’ methods, 561 
ordinary and minor cycle deter¬ 
mination by empirical 
methods, 561-562 
use of functions of arc tangent, 
562 

use of orthogonal polynomials, 
562, 599-616 

empirical trends, 558-560 

application to cycle analysis, 
558-560 

empirical vs. rational trends, 564 
rational basis for, 543-563 
rational trends, application to 
social philosophy, 553-558 
basis for rationalizing, 550-552 
criticism of, 552—553 
early population theories, 549- 
550 

historical background, 547-548 
population curves, 548-549 
{See also Cycle determination; 
Seasonal variation); Trend 
analysis 

Trade, Department of Commerce, 86 
Commerce Yearbook, 86 
domestic, 81 

Federal Trade Commission, 81 

foreign, 77, 82, 86 

Bureau of Foreign and Domes¬ 
tic Commerce, 77 
Statistical Abstract of the 
United States, 77 
Survey of Current Business, 77 
Statesman's Yearbook, 86 


Trade, U.S. Tariff Commission, 82 
Transportation and communication 
statistics, 81 

Interstate Commerce Commis¬ 
sion, 81 

Treasury Department, 7 
Treatise on Money, J. M. Keynes, 
517 

Trend analysis, 564r-616 

detecting cycle by removing em¬ 
pirical trend, 565 
empirical vs. rational trend, 564 
empirical trends, illustrated, 582- 
598 

analysis of cycles by empirical 
trends, 594r-598 

finite differences method for 
finding trend values, 589-594 
polynomials, 583-594 
straight-line trend, 582-583 
methods of fitting trend, 565-574 
by averages, 573 

moving averages method, 
573-574 

by least squares, 565-573 
advantages of method, 574 
basic method, 565-568 
numerical illustration, 568— 
569 

probability theory not ap¬ 
plied, 570-571 

second- or third-degree 
curves, 569-570 
by selected points, 571-573 
orthogonal-polynomial trends (see 
Orthogonal-ploynomial 

trends) 

rational trends, illustrated, 574- 
581 

dying institution, 574-577 
growing institution, 578-581 
Trends, empirical (see Empirical 
trends) 

orthogonal-polynomi^ (see Orth¬ 
ogonal-polynomial trends) 

rational (see Rational trends) 
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Trivariate frequency distribution, 
405-410 

Twentieth Century Fund, The, 68 
U 

U.S. Bureau of Census, 42, 48, 53- 
56, 532 

U.S. Census, 30-38, 532 
development of, 71-76 
U.S. Census of Agriculture, 38-42 
U.S. Department of Agriculture, 12, 
524 (see also Agricultural Sta¬ 
tistics) 

U.S. Department of Commerce, 56, 
504, 512, 524, 531 
Bureau of Census (see U.S. 

Bureau of Census) 

Bureau of Foreign and Domestic 
Commerce, 56, 76-77, 512, 
535 

Survey of Current Business^ 525, 
531, 534 

U.S. Department of Labor, 525 
U,S. Government Manual, 64 
Units, careful description necessary 
in time series, 126-127 
of enumeration, 1 
of measurement, 1-2 
Univariate frequency distribution, 
325 

V 

Van Buren, President Martin, 72 
Variable attributes, 157 
Variable, continuous, illustrated, 
292-293 

discrete, illustrated, 291-294 
but not integral, illustrated, 292 
integral, illustrated, 291-292 
nonintegral, illustrated, 292 
Variable X, in an array, illustrated 
use of, 139-140 
VariabiHty, 182-185 
average deviation, 183-184 
range, 182-183 


Variability, standard deviation, 184- 
185 

universal condition facing scien¬ 
tist, 122 

Variance, calculation of, 215 
dehnition of, 185 
6rst-order, calculated, 363-364 
defined, 338 
meaning of, 336-338 
relation to r, 351-352 
proportion measured, by square of 
correlation coefficient, 353 
by square of correlation index, 
396 

by square of correlation ratio, 
373-376 

sampling distribution of, 316 
second order, for linear plane of 
regression, 413-416 
third order, 434 

Variation, frequency series, 138-143 
static, frequency distribution as 
tool for analysis, 158 
static vs. dynamic, 154-155 
Venn, J., 247 

Verhulst growth curve, 550-551 
W 

Walker, Francis A., 72 
Ward, Lester Frank, 11 
William the Conqueror, 25 
Works Progress Administration, 53 
538 

World Almanac, 64 
World Economic Survey, 85 
World Peace Foundation, 91 
Wright, Carroll D., 78 

Y 

Yule, G. Udny, 324, 549-551 
Z 

Zero order statistics in correlation, 
444 
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